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ABSTRACT 


The  purpose  of  this  article  is  to  create  a  general  framework  for  the  study  of  parallel 
algorithms.  A  taxonomy  of  parallel  algorithms,  based  on  their  relations  to  parallel  computer 
architectures,  is  introduced.  Examples  of  parallel  algorithms  for  many  architectures  are  given; 
they  include  algorithms  for  SIMD  array  processors,  for  MlMD  multiprocessors,  and  for  direct 
chip  implementations.  By  presenting  these  algorithms  in  a  single  place,  issues  and  techniques 
in  designing  algorithms  for  various  types  of  parallel  architectures  are  discussed  and 
compared. 
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1.  Introduction 

There  is  a  largo  body  of  literature  on  parallel  algorithms.  Parallel  algorithms  have  been 
studied  since  the  early  sixties  (see  the  survey  [Miranker  71]),  although  at  that  time  no 
parallel  computers  had  been  constructed.  It  has  always  been  the  case  that  many  researchers 
find  designing  parallel  algorithms  fascinating  and  challenging,  regardless  of  whether  or  not 
their  algorithms  will  be  used  in  practice.  Increasing  interests  in  parallel  algorithms  are 
created  by  the  emergence  of  large  scale  parallel  computers  in  the  past  decade.  As  a  result,  a 
variety  of  algorithms  have  been  designed  for  various  parallel  computer  architectures.  For 
surveys  of  parallel  architectures  and  parallel  algorithms  see  [Anderson  and  Jensen  75,  Stone 
75,  Kung  76,  Endow  77,  Fuck  77,  Ramamoorthy  and  Li  77,  Sameh  77,  Heller  78,  Kuck  78].  The 
recent  advent  of  large  scale  integration  technology  has  further  stimulated  interests  in  parallel 
algorithms.  Algorithms  have  been  designed  for  direct  chip  implementation  (see,  e.g.,  [Kung 
79]).  Hence  there  is  a  vast  amount  of  parallel  algorithms  known  today,  designed  from  many 
different  viewpoints. 

This  article  presents  many  examples  Of  parallel  algorithms  and  studies  them  under  a 
uniform  framework.  In  Section  2  we  identify  three  important  attributes  of  a  parallel  algorithm 
and  classify  parallel  algorithms  in  terms  of  these  attributes.  Our  classification  of  parallel 
algorithms  corresponds  naturally  to  that  of  parallel  architectures.  Algorithms  for  synchronous 
parallel  computers  are  considered  in  Section  3,  where  examples  of  algorithms  using  various 
communication  geometries  are  presented.  Section  4  considers  algorithms  for  asynchronous 
parallel  computers.  In  that  section,  we  discuss  a  number  of  techniques  to  deal  with  the 
difficulties  arising  from  the  asynchronous  .behavior  of  computation,  and  our  examples  are 
mainly  drawn  from  results  in  concurrent  database  systems.  Section  5  contains  some 
concluding  remarks. 

The  author  hopes  that  by  presenting  parallel  algorithms  of  many  different  types  in  a  single 
place,  this  article  can  be  useful  to  readers  who  wish  to  understand  the  basic  issues  and 
techniques  in  designing  parallel  algorithms  for  various  architectures.  The  article  can  be 
useful  as  well  to  readers  who  wish  to  know  what  parallel  algorithms  are  available,  in  order  to 
decide  on  the  best  way  to  design  or  choose  a  parallel  architecture. 
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2.  The  Space  of  Parallel  Algorithms:  Taxonomy  and  Relation  to 
Parallel  Architectures 

2.1  Introduction 

We  view  a  parallel  algorithm  as  a  collection  of  independent  task  modules  which  can  be 
executed  in  parallel  and  which  communicate  with  each  other  during  the  execution  of  the 
algorithm.  In  Section  2.2,  we  identify  three  orthogonal  dimensions  of  the  space  of  parallel 
algorithms:  concurrency  control,  module  granularity,  and  communication  geometry.  Along 
each  dimension,  we  illustrate  some  important  positions  that  parallel  algorithms  can  assume, 
but  no  attempt  will  be  made  to  list  all  possible  positions.  In  Section  2.3,  we  characterize 
parallel  algorithms  that  correspond  to  three  important  parallel  architectures  along  the 
concurrency  control  and  module  granularity  dimensions.  In  Section  2.4,  this  characterization 
together  with  the  third  dimension  --  communication  geometry  —  forms  a  taxonomy  for 
parallel  algorithms.  Our  taxonomy  is  crude  and  is  by  no  means  meant  to  be  complete.  The 
main  purpose  of  introducing  it  here  is  to  provide  a  framework  for  later  discussions  in  this 
paper.  Wc  hope  that  future  work  on  the  taxonomy  will  make  it  possible  to  unambiguously 
classify  parallel  algorithms  at,  a  conceptual  level,  and  to  relate  each  parallel  algorithm  to  those 
parallel  architectures  to  which  it  naturally  corresponds. 


2.2  The  Three  Dimensions  of  the  Space  of  Parallel  Algorithms 

Concurrency  Control 

In  a  parallel  algorithm,  because  more  than  one  task  module  can  be  executed  at  a  time, 
concurrency  control  is  needed  fo  ensure  the  correctness  of  the  concurrent  execution.  The 
concurrency  control  enforces  desired  interactions  among  task  modules  so  that  the  overall 
execution  of  the  parallel  algorithm  will  be  correct.  The  leaves  of  the  tree  in  Fig.  2-1 
represent  the  space  of  concurrency  controls  which  can  be  used  in  parallel  algorithms.  For 
example,  the  left  most  leaf  represents  the  concurrency  control  of  an  algorithm  whose  task 
modules  execute  in  lock-step  the  same  code  broadcast  by  the  central  control,  while  the 
second  left  most  leaf  represents  a  synchronous  distributed  control  achieved  by  simple  local 
control  mechanisms. 

Module  Granularity 

The  module  granularity  of  a  parallel  algorithm  refers  to  the  maximal  amount  of  computation 
a  typical  task  module  can  do  before  having  to  communicate  with  other  modules.  The  module 
granularity  of  a  parallel  algorithm  reflects  whether  or  not  the  algorithm  tends  to  be 
communication  intensive.  For  example,  a  parallel  algorithm  with  a  small  module  granularity 


SECTION  2 


THE  SPACE  OF  PARALLEL  ALGORITHMS 


3 


l - 

(CENTRALIZED  CONTROL] 


(CONCURRENCY  CONTROL] 
1 - J 


1 


(0 1ST  RIOUTED  CONTROL] 

I  _ 


[SYNCHRONOUS] 


[SYNCHRONOUS] 

_ I _ 


[ASYNCHRONOUS] 


[CONTROL  VIA  SHARED  DATA] 

I 

[ASYNCHRONOUS] 


[SIMPLE  ~j  [COMPLEX  1  [ilMPlE  COMPLEX 

{LOCAL  control]  {local  CONTROL]  {local  CONTROL  _tOCAL  CONTROL 


Figure  2-1:  A  classification  of  concurrency  controls  of  parallel  algorithms  -  leaves  of  the 
tree  representing  various  types  of  concurrency  controls. 

will  require  frequent  intermodule  communication.  In  this  case,  for  efficiency  reasons  it  may 
be  desirable  to  provide  proper  data  paths  in  hardware  to  facilitate  the  communication.  For 
the  purpose  of  this  paper,  we  shall  classify  module  granularities  of  parallel  algorithms  into 
only  three  groups.  See  Fig.  2-2. 
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Figure  2-2:  A  classification  of  module  granularities  of  parallel  algorithms. 
Communication  Geometry 

Suppose  that  task,  modules  of  a  parallel  algorithm  are  connected  to  represent  intermodule 
communication.  Then  a  geometric  layout  of  the  resulting  network  is  referred  to  as  the 
communication  geometry  of  the  algorithm.  The  leaves  of  the  tree  in  Fig.  2-3  represent  the 
space  of  communication  geometries.  For  example,  leal  HEXAGONAL  represents  communication 
geometries  that  correspond  to  regular  2-dimensional  hexagonal  arrays  (see  Fig.  3-9  (b)). 

2.3  Matching  Parallel  Algorithms  with  Parallel  Architectures 

It  is  straightforward  for  one  to  assess  the  matching  between  parallel  algorithms  and 
parallel  architectures  along  the  communication  geometry  dimension.  Mere  we  discuss  the  less 
obvious  matching  along  the  other  two  dimensions:  concurrency  control  and  module 
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Figure  2-3:  A  classification  of  communication  geometry  of  parallel  algorithms  -  leaves  of  the 
tree  representing  various  types  of  communication  structures. 

granularity.  We  consider  three  architectures  and  their  matching  algorithms  that  are  relevant 
to  our  discussions  in  Sections  3  and  4. 

SIMD  and  MIMO  Machines  and  Algorithms 

The  notion  of  single-instruction  stream  multiple-data  stream  (SIMD)  and  multiple-instruction 
stream  multiple-data  stream  (MIMO)  parallel  computers  [Flynn  66]  is  often  used  in  the 
literature  for  classifying  parallel  computers.  With  a  SIMD  machine  such  as  ILLIAC  IV  [Barnes 
et  al.  68],  one  stream  of  instructions  issued  by  the  central  control  unit  controls  all  the 
processors,  each  operating  upon  its  own  memory  synchronously.  With  a  MIMD  machine  such 
as  C.mmp  [Wulf  and  Bell  72],  Cm*  [Fuller,  et  al.  77],  or  Pluribus  [Heart  et  al.  73],  the 
processors  have  independent  instruction  counters,  and  operate  asynchronously  on  shared 
memories.  SIMD  machines  correspond  to  synchronous  lock-step  algorithms  that  require 
central  controls,  whereas  MIMO  machines  correspond  to  asynchronous  algorithms  with 
relatively  large  granularities  [Kung  76],  Algorithms  that  match  with  SIMD  and  MIMD  machines 
are  called  SIMD  and  mimo  algorithms,  respectively.  See  Fig.  2-4. 

Systolic  Machines  and  Algorithms 

Developments  in  microelectronics  have  revolutionized  computer  design.  Large  Scale 
Integration  (LSI)  technology  has  increased  the  number  and  complexity  of  components  that  can 
fit  on  a  chip.  In  fact,  component  density  has  been  doubling  every  one-to-two  years  for  more 
than  a  decade.  Today  a  single  chip  can  contain  hundreds  of  thousands  of  devices.  As  a 
result,  machines-on-a-chip  have  emerged;  these  machines  can  be  used  as  special  purpose 
devices  attached  to  a  conventional  computer.  "Systolic  machines"  represent  one  class  of  such 
machines  that  have  regular  structures.  Intuitively  a  systolic  machine  is  a  network  of  simple 
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Figure  2-4:  Characterizations  of  parallel  algorithms  that  match  with  systolic,  SIMo,  AND  MIMD 
machines,  along  the  concurrency  control  and  module  granularity  dimensions. 

and  primitive  processors  that  circulate  data  in  a  regular  fashion  [Kung  and  Leiserson  79]. 
The  word  "systole"  was  borrowed  from  physiologists  who  use  it  to  refer  to  Ihe  rhythmically 
recurrent  contractions  of  the  heart  and  arteries  which  pulse  blood  through  the  body.  For  a 
systolic  machine,  the  function  of  a  processor  is  analogous  to  that  of  the  heart.  Each 
processor  regularly  pumps  data  in  and  out,  each  time  performing  some  short  computation,  so 
that  a  regular  flow  of  data  is  kept  up  in  the  network.  At  every  processor  the  control  for 
communication  and  computation  is  very  simple,  and  the  storage  space  is  only  a  small  constant, 
independent  of  the  size  of  the  network.  For  a  low  cost  and  high  performance  chip 
implementation,  it  is  crucial  that  the  geometry  of  the  communication  paths  in  a  systolic 
machine  be  simple  and  regular.  The  geometric  problem  will  be  treated  in  detail  in  Section  3. 
Systolic  machines  correspond  to  synchronous  algorithms  that  use  distributed  control  achieved 
by  simple  local  control  mechanisms  and  that  have  (smalt)  constant  module  granularities. 
Algorithms  that  match  with  systolic  machines  are  called  systolic  algorithms.  See  Fig.  2-4. 
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2.4  A  Taxonomy  for  Parallel  Algorithms 

Let  {CONCURRENCY  CONTROLS},  {MODULE  GRANULARITIES},  and  {COMMUNICATION  GEOMETRIES}  be  the 
sets  of  leaves  in  Fig.  2-1,  2-2,  and  2-3,  respectively.  Then  the  cross  product  {CONCURRENCY 
CONTROLS}*  {MODULE  GRANULARITIES}* [COMMUNICATION  GEOMETRIES}  represents  the  space  of  parallel 
algorithms.  One  could  give  a  tavonomy  for  parallel  algorithms  which  classifies  algorithms  in 
terms  of  their  positions  in  this  three-dimensional  space,  but  the  space  is  seen  to  be  large  and 
contains  quite  a  few  uninteresting  cases.  We  therefore  restrict  ourselves  to  a  small  subspace 
which  nevertheless  contains,  we  believe,  most  of  the  interesting  and  significant  parallel 
algorithms.  This  subspace  is  the  cross  product  {SYSTOLIC,  S1MD, 

M1MD}* {COMMUNICATION  GEOMETRIES},  where  SYSTOLIC,  SIMD  AND  MIMD  are  three  particular  positions 
in  the  space  {CONCURRENCY  CONTROLS}x{MODULE  GRANULARITIES}  that  represent  systolic,  SIMD,  AND 
MIMD  algorithms,  respectively  (c.f.  Fig.  2-4). 

We  name  algorithms  in  {SYSTOLIC,  SIMD,  MIMD}*{C0MMUN!CATI0N  GEOMETRIES}  in  a  natural  way. 
For  example,  an  algorithm  is  called  a  systolic  algorithm  using  a  hexagonal  array,  if  it  is 
systolic  and  its  communication  geometry  is  a  hexagonal  array. 

Generally  speaking,  among  the  three  types  ol  algorithms  (SYSTOLIC,  SIMD  and  MIMD),  systolic 
algorithms  are  most  structured  and  MIMD  algorithms  are  least  structured.  For  a  systolic 
algorithm,  task  modules  are  simple  and  interactions  among  them  are  frequent.  The  situation  is 
reversed  for  MIMD  algorithms.  Systolic  algorithms  are  designed  for  direct  hardware 
implementations,  while  MIMD  algorithms  are  designed  for  executions  on  general  purpose 
multiprocessors.  SIMD  algorithms  may  be  seen  as  lying  between  the  other  two  types  of 
algorithms.  Using  the  central  control,  SIMD  algorithms  can  broadcast  parameters  and  handle 
exceptions  rather  easily.  These  reasons  make  SIMD  algorithms  attractive  in  some  cases. 

In  summation,  along  the  concurrency  control  and  module  granularity  dimensions  we  have 
classified  parallel  algorithms  into  three  classes:  SYSTOLIC,  SIMD,  and  mimd.  Each  class  of 
algorithms  can  further  adopt  various  communication  geometries.  Figure  2-5  presents 
examples  in  the  space  {SYSTOLIC,  SIMD,  MIMD}* {COMMUNICATION  GEOMETRIES}.  Most  of  these 
parallel  algorithms  will  be  discussed  in  the  rest  of  the  paper.  Systolic  and  SIMD  algorithms 
will  be  treaded  in  Section  3,  whereas  mimd  algorithms  will  be  studied  in  Section  4. 


SYSTOLIC  ALGORITHMS  USING 


1-DIM  LINEAR  ARRAYS 


2-DIM  SQUARE  ARRAYS 


2-DIM  HEXAGONAL  ARRAYS 


TREES 


SHUFFLE-EXCHANGE 


SIMO  ALGORITHMS 


MIMO  ALGORITHMS 


REAL-TIME  FIR-FILTERING,  DISCRETE  FOURIER  TRANSFORM  (DFT), 
CONVOLUTION.  MATRIX-VECTOR  MULTIPLICATION,  RECURRENCE  EVALUATION. 
SOLUTION  OF  TRIANGULAR  LINEAR  SYSTEMS,  CARRY  PIPELINING, 

SORTING,  PRIORITY  QUEUE,  CARTESIAN  PRODUCT,  PIPELINE  ARITHMETIC  UNITS 

PATTERN  MATCHING,  GRAPH  ALGORITHMS  INVOLVING  ADJACENCY  MATRICES, 
DYNAMIC  PROGRAMMING  FOR  OPTIMAL  PARENTHESIZATION 

MATRIX  PROBLEMS  (MATRIX  MULTIPLICATION,  LU-OECOMPOSITION  BY 
GAUSSIAN  ELIMINATION  WITHOUT  PIVOTING,  QR-FACTORIZATION), 

TRANSITIVE  CLOSURE.  OFT,  RELATIONAL  DATABASE  OPERATIONS 

SEARCHING  ALGORITHMS  (QUERIES  ON  NEAREST  NEIGHBOR,  RANK,  ETC, 
SYSTOLIC  SEARCH  TREE),  PARALLEL  FUNCTION  EVALUATION, 

RECURRENCE  EVALUATION 

FAST  FOURIER  TRANSFORM,  B1TON1C  SORT 


NUMERICAL  RELAXATION  FOR  PARTIAL  DIFFERENTIAL  EQUATIONS  OR 
IMAGE  PROCESSING,  GAUSSIAN  ELIMINATION  WITH  PIVOTING,  MERGE  SORT. 

(IN  GENERAL,  CORRESPONDING  TO  EACH  SYSTOLIC  ALGORITHM  THERE  IS  A  SIMD 
ALGORITHM  CONSISTING  OF  TASK  MODULES  WITH  LARGER  GRANULARITIES.) 


CONCURRENT  DATABASE  ALGORITHMS  (CONCURRENT  ACCESSES  TO 
B-TRCES  OR  BINARY  SEARCH  TREES,  CONCURRENT 
DATABASE  REORGANIZATION  -  GARBAGE  COLLECTION),  CHAOTIC 
RELAXATION,  DYNAMIC  SCHEDULING  ALGORITHMS,  ALGORITHMS  WITH 
LARGE  MOOULE  GRANULARITIES 


Figure  2-5:  Examples  in  the  parallel  algorithm  space. 
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3.  Algorithms  for  Synchronous  Parallel  Computers 

3.1  Introduction 

We  consider  in  this  section  parade!  algorithms  for  synchronous  parallel  computers,  which 
include  systolic  and  SIMD  machines  described  in  Section  2.  These  algorithms  will  be  classified, 
to  first  order  at  least,  according  to  their  communication  geometries.  Results  in  this  section 
should  provide  useful  insights  into  (he  problem  of  selecting  interconnection  networks  for 
systolic  or  SIMD  machines. 

As  mentioned  in  Section  2.3,  the  existence  of  a  cost  effective  chip  implementation  of  a 
systolic  algorithm  in  LSI  technology  depends  crucially  on  the  communication  geometry  of  the 
algorithm.  It  is  highly  desirable  that  communication  geometries  be  simple  and  regular.  Such 
structures  lead  to  cheap  implementations  and  high  densities.  In  turn,  high  density  implies 
both  high  performance  and  low  overhead  for  support  components.  For  more  discussions  on 
this  matter,  see  [Sutherland  and  Mead  77,  Foster  and  Kung  79].  In  this  section,  special 
attention  will  be  paid  to  those  structures  which  are  simple  and  regular. 

One  of  the  main  concerns  in  the  design  and  verification  of  synchronous  algorithms  defined 
on  networks  is  to  ensure  that  required  data  items  will  reach  the  right  places  at  the  right 
times  to  interact  with  each  other.  For  this  reason,  we  shall  often  illustrate  algorithms  by 
their  data  flow  diagrams.  For  systolic  machines,  further  attention  is  needed  to  ensure  that 
the  execution  of  a  task  module  requires  only  a  small  constant  amount  of  time  and  space.  We 
assume  throughout  the  section  that  it  takes  a  unit  of  time  to  send  a  unit  of  data  from  a 
processor  to  any  of  its  topological  neighbors.  (See  discussions  in  Section  3.4  (or  the 
rationale  of  this  assumption  lor  a  case  involving  wires  of  different  lengths.)  Under  this 
assumption,  we  shall  show  that  many  problems  which  require  nonlinear  (e.g.,  0(n  log  n),  O(n^), 
or  0(n^))  times  on  uniprocessors  can  be  solved  in  linear  times  on  systolic  machines  with 
enough  processors.  Algorithms  for  systolic  machines  can  run  on  corresponding  SIMD  machines 
with  similar  underlying  interconnection  structures  without  losing  efficiency,  but  not  vice 
versa.  The  unique  capabilities  of  SIMD  machines  for  broadcasting  data  and  instruction  codes, 
and  for  storing  a  relatively  large  amount  of  data  local  to  each  processor  can  be  crucial  to  the 
efficiency  of  some  algorithms.  Algorithms  presented  in  this  section  are  in  general  suitable  for 
systolic  machines,  unless  stated  otherwise. 
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3.2  Algorithms  Using  One-dimensional  Linear  Arrays 

One-dimensional  linear  arrays  (Fig.  3-1)  represent  the  simplest  and  also  the  most 
fundamental  geometry  for  connecting  processors.  Shift-resisters  can  implement  linear  arrays 
directly.  Surprisingly  enough,  for  a  large  number  of  important  algorithms  this  simple 
structure  is  all  that  is  needed  for  communication. 


0  12  3 


Figure  3-1:  A  one -dimensional  linear  array. 

In  the  following,  we  give  four  algorithms  using  linear  arrays.  The  first  algorithm 
concerning  odd-even  transposition  sort  is  perhaps  the  most  well-known  algorithm  using  a 
linear  processor  array  (see,  for  example,  [Knuth  73]  and  [Mukhopadhyay  and  Ichikawa  72]). 
The  latter  three  algorithms  demonstrate  an  important  way  of  using  linear  arrays.  That  is,  a 
linear  array  can  be  viewed  as  a  pipe  and  thus  is  natural  for  pipeline  computations. 
Depending  on  the  algorithm,  data  may  flow  in  only  one  direction  or  in  both  directions 
simultaneously.  We  show  that  two-way  pipelining  is  a  simple  and  powerful  construct  for 
realizing  complex  compulations.  Following  the  discussions  of  the  four  algorithms,  we  mention 
the  use  of  linear  pipelines  in  the  implementation  of  arithmetic  operations. 

For  ease  in  describing  these  algorithms,  we  shall  number  the  processors  from  left  to  right 
by  integers  0,  1, ....  as  in  Fig.  3-1. 

Odd-Even  Transposition  Sort 

Given  n  keys  stored  in  a  linear  array  of  processors,  one  key  in  each  processor,  the 
problem  is  to  sort  them  in  ascending  order.  The  problem  can  be  solved  in  n  steps  by  using 
the  odd-even  transposition  sort.  Odd  and  even  numbered  processors  are  activated 
alternately.  Assume  that  the  even  numbered  processors  are  activated  first.  In  each  cycle, 
the  following  comparison-exchange  operations  take  place:  the  key  in  every  activated 
processor  is  first  compared  with  the  key  in  its  right  hand  neighboring  processor,  and  then 
the  smaller  one  is  stored  in  the  activated  processor.  Within  n  cycles,  the  keys  will  be  sorted 
in  the  array  (see  Fig.  3-2). 

The  idea  generalizes  directly  to  the  case  where  each  processor  holds  a  sorted 
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Figure  3-2:  The  odd-even  transposition  sort  on  a  one-dimensional  linear  array. 

subsequence  of  Keys  rather  than  a  single  key  [Baudet  and  Stevenson  78].  For  this  case,  the 
comparison-exchange  operation  becomes  the  merge-splitting  operation.  Using  this 
generalization,  one  can  sort  n  keys  on  k  linearly  connected  processors  in 
O((n/k)log  (n/k))  +  0(k  (n/k))  time,  provide  that  each  processor  can  hold  n/k  keys  (this  is 
possible  for  SIMD  machines).  In  the  above  expression,  the  first  term  is  to  the  time  to  sort  an 
(n/k)-subsequence  at  each  processor,  and  the  second  term  is  to  the  time  to  perform  the 
odd-even  transposition  sort  on  k  sorted  (n/k)-subsequences.  It  is  readily  seen  that  when  n  is 
large  relative  to  k,  a  speed-up  ratio  near  k  is  obtained.  This  near  optimal  speed-up  (with 
respect  to  the  number  of  processors  used)  is  due  to  the  fact  that  when  n  is  large  relative  to 
k,  the  computation  done  within  each  processor  is  large,  as  compared  to  interprocessor 
communication.  Thus,  the  overheads  arising  from  interprocessor  communication  become 
relatively  insignificant. 

Real-Time  Finite  Impulse  Response  (FIR)  Filtering 

One  of  the  most  frequently  performed  computations  in  signal  processing  is  that  of  a  FIR 
filter.  The  computation  of  a  p-tap  FIR  filter  can  be  viewed  as  a  matrix-vector  multiplication 
where  the  matrix  is  a  band  upper  triangular  Toeplitz  matrix  with  band  width  p.  Figure  3-3 
represents  the  computation  of  a  4-tap  filter. 
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Figure  3-3:  The  compulation  of  a  4-tap  FIR  filter  with  coefficients  aj,  a2*  83,  and  a^. 


In  the  figure,  the  sequence  Xj,  *2>  *3  •  •  •  corresponds  to  a  real-time  data  stream  obtained 
by  sampling  the  signal  at  times  t,  t  +  &  ,  t  ♦  2S  , . . .  ,  and  constants  aj,  ag,  83,  and  a^  are  the 
taps  of  the  filter.  A  p-tap  filter  can  be  implemented  efficiently  by  a  linear  array  consisting  of 
p  inner  product  step  processors,  each  capable  of  performing  one  multiplication  and  one  add 
in  a  unit  of  time.  We  illustrate  the  operation  of  the  linear  array  by  considering  the  filtering 
problem  in  Fig.  3-3.  The  taps  aj  are  stored  in  the  array  at  the  beginning  of  the  computation, 
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one  in  each  processor,  and  they  cio  not  move  during  the  computation  (cf.  Fig.  3-4).  The  yj, 
which  arc  initially  zero,  marches  to  the  left,  while  the  Xj  are  marching  to  the  right.  All  the 
moves  are  synchronized,  and  the  x^s  and  y j's  are  separated  by  two  time  units.  It  is  readily 
seen  that  each  yj  is  able  to  accumulate  all  its  terms,  namely,  ajXj,  a2Xj+j,  a3*j+2’  an^  a4*i*3> 
before  it  leaves  the  array  at  the  left  end  processor.  Therefore  the  yj’s  are  computed  in 
real-time  in  the  sense  that  they  are  output  in  the  same  rate  as  the  xj’s  are  input. 


0  12  3 


a3  az  Q1 

Figure  3-4:  The  one-dimensional  linear  array  for  the  computation  of  the  4-tap 

filtering  in  Fig.  3-3. 


We  now  specify  the  operation  of  the  linear  array  more  precisely.  Each  processor  has 
three  registers,  Ra,  Ry  and  Ry,  which  hold  a,  x,  and  y  values,  respectively.  Initially,  all  Rx  and 
Ry  registers  contain  zeros,  and  the  Ra  register  at  processor  i  contains  the  value  of  a^_j. 
Each  step  of  the  array  consists  of  the  following  operations,  but  for  odd  numbered  steps  only 
even  numbered  processors  are  activated  and  for  even  numbered  steps  only  odd  numbered 
processors  are  activated. 

1.  Shift. 

-  Ry  gets  the  contents  of  register  Ry  from  the  left  neighboring  processor. 

(The  Ry  in  processor  0  gets  a  new  component  of  x.) 

-  Ry  gets  the  contents  of  register  Ry  from  the  right  neighboring  processor. 
(Processor  0  outputs  its  Ry  contents  and  the  Ry  in  processor  3  gets  zero.) 

2.  Multiply  and  Add. 

Ry  4-  Ry  ♦  Ra  x  Rx. 

After  p  units  of  time  final  results  of  the  y^'s  are  pumped  out  from  the  left  end  processor  at 
the  rate  of  one  output  every  two  units  of  time.  Fig.  3-5  illustrates  four  steps  of  the  linear 
array.  Observe  that  when  yj  is  ready  to  get  out  from  the  left  end  processor  at  the  end  of 
the  seventh  step,  yj  ”  alxl,fa2x2+  a3*3*a4x4>  Y2  “  alx2'*'a2x3' 
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Figure  3-5:  Four  steps  of  the  linear  array  in  Fig.  3-4. 
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The  FIR  result  mentioned  here  is  a  special  case  of  a  result  in  [Kung  and  Leiserson  79] 
concerning  linear  processor  arrays  for  general  matrix-vector  multiplications.  Similar  results 
hold  for  the  computation  of  convolutions  or  discrete  Fourier  transforms.  In  general,  if  A  is  an 
nxn  matrix  of  band  width  w,  then  a  linear  array  of  w  processors  can  multiply  A  with  any 
n-vector  in  0(n)  time,  as  compared  to  0(wn)  time  needed  for  a  sequential  algorithm  on  a 
uniprocessor  computer. 

Priority  Queue 

A  data  structure  that  can  process  INSERT,  DELETE,  and  EXTRACT.MiN  operations  is  called  a 
priority  queue.  Priority  queues  are  basic  structures  used  in  many  programming  tasks.  If  a 
priority  queue  is  implemented  by  some  balanced  tree,  for  example  2-3  tree,  then  an 
operation  of  the  queue  will  typically  take  Oflog  n)  time  when  there  are  n  elements  stored  in 
the  tree  [Aho  et  al.  75}  This  0(log  n)  delay  can  be  replaced  with  a  constant  delay  if  a  linear 
array  of  processors  is  used  to  implement  the  priority  queue.  Here  we  shall  only  sketch  the 
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basic  idea  behind  the  linear  array  implementation.  A  complete  description  will  be  reported 
elsewhere. 

To  visualize  the  algorithm,  we  assume  that  the  linear  array  in  Fig.  3-1  has  been  physically 
rotated  90  degrees  and  that  processors  are  capable  of  performing  comparison-exchange 
operations  on  elements  in  neighboring  processors.  We  try  to  maintain  elements  in  the  array 
in  the  sorted  order  according  to  their  weights.  After  an  element  is  inserted  into  the  array 
from  the  top,  it  will  "sink  down"  to  the  proper  place  by  trading  positions  with  elements 
having  smaller  weights  (so  lighter  elements  will  "bubble  up").  For  deleting  an  element,  we 
insert  an  “anti-clement"  which  first  sinks  down  from  the  top  to  find  the  element,  and  then 
annihilates  it.  Elements  below  can  then  bubble  up  into  the  empty  processor.  Hence  the 
element  with  the  smallest  weight  will  always  appear  at  the  top  of  the  processor  array,  and  is 
ready  to  be  extracted  in  constant  time.  An  important  observation  is  that  "sinking  down"  or 
"bubbling  up"  operations  can  be  carried  out  concurrently  at  various  processors  throughout 
the  array.  For  example,  the  second  insertion  can  start  right  after  the  first  insertion  has 
passed  the  top  processor.  In  this  way,  any  sequence  of  n  INSERT,  DELETE,  or  EXTRACT.M1N 
operations  can  be  done  in  0(n)  time  on  a  linear  array  of  n  processors,  rather  than  0(n  log  n) 
time  as  required  by  a  uniprocessor.  In  particular,  by  performing  n  INSERT  operations  followed 
by  n  EXTRACT.MIN  operations  the  array  can  sort  n  elements  in  0(n)  time,  where  the  sorting 
time  is  completely  overlapped  with  input  and  output.  A  similar  result  on  sorting  was  recently 
proposed  by  [Chen  et  al.  78],  They  do  not,  however,  consider  the  deletion  operation. 

Recurrence  Evaluation  (Recursive  Filtering) 

Many  computational  tasks  such  as  recursive  digit  filtering  are  concerned  with  evaluations 
of  recurrences.  A  k-th  order  recurrence  problem  is  defined  as  follows:  Given 

x0>  x-l>  •••*  x-k+l'  comPu*e  xi>  x2*  ••••  defined  by 

x,  *  Rj<xi-J»  ••••  xi-k>  for  i>0> 

where  the  Rj’s  are  given  "recurrence  functions".  For  a  large  class  of  recurrence  functions,  a 
k-th  order  recurrence  problem  can  be  solved  in  real-time  on  k  linearly  connected  processors 
[Kung  79].  That  if.,  a  new  Xj  is  output  at  regular  time  intervals,  at  a  frequency  independent  of 
k.  To  illustrate  the  idea,  we  consider  the  following  linear  recurrence: 

xj  -  axi-i  ♦  bXj_2  ♦  cxj_3  ♦  d, 

where  the  a,  b,  c  and  d  are  constants.  Clearly  feedback  links  are  needed  for  evaluating  such 
a  recurrence  on  a  linear  array,  since  every  newly  computed  term  has  to  be  used  later  for 
computing  other  terms.  The  classical  network  with  feedback  loops  is  depicted  in  Fig.  3-6. 

Each  processor  (except  the  right -most  one,  which  has  more  than  one  output  port)  is  the 
inner  product  step  processor  similar  to  the  one  used  before  for  FIR  filtering.  The  Xj, 
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Figure  3-6:  A  linear  array  with  feedback  loops 

initialized  as  d,  gets  cxj.3,  bx^.  and  a*,-!  a*  time  1,  2  and  3,  respectively.  At  time  4,  the 
final  value  of  Xj  is  output  from  the  right-most  processor,  and  is  also  fed  back  to  all  the 
processors  for  use  in  computing  Xj  +  j,  xj+2  and  Xj+g.  The  feedback  loops  in  Fig.  3-6  are 
undesirable,  since  they  make  the  network  irregular  and  non-modular.  Fortunately,  these 
irregular  feedback  loops  ran  be  replaced  with  a  regular,  two-way  data  flow  scheme.  Assume 
that  each  processor  is  capable  of  performing  the  inner  product  step  and  also  passes  data  as 
depicted  in  Fig.  3-7  (b).  A  two-way  pipeline  algorithm,  without  irregular  feedback  loops,  for 
evaluating  the  linear  recurrence  is  schematized  in  Fig.  3-7  (a).  The  additional  processor, 
drawn  in  dotted  lines,  passes  data  only  and  is  essentially  a  delay.  Each  Xj  enters  the  right 
most  processor  with  value  zero,  accumulates  its  terms  as  marching  to  the  left,  and  feeds  back 
its  final  value  to  the  array  through  the  left-most  processor  for  use  in  computing  Xj+j,  xj+2 
and  Xj+g.  The  final  values  of  the  Xj’s  are  output  from  the  right-most  processor  at  the  rate  of 
one  output  every  fwo  units  of  time. 

This  example  shows  that  two-way  pipelining  is  a  powerful  construct  in  the  sense  that  it 
can  eliminate  undesirable  feedback  loops  as  those  encountered  in  Fig.- 3-6.  Extensions  of  the 
two-way  pipelining  approach  to  more  general  recurrence  problems  are  considered  in  [Kung 
79].  Basically  the  two-way  pipelining  idea  is  as  follows:  By  having  two  data  streams  travel 
in  opposite  directions,  a  data  item  in  one  stream  can  meet  all  data  items  in  the  other  stream 
and  thus  their  Cartesian  product  can  be  formed  in  parallel  in  all  stages  of  the  pipe.  Since 
Cartesian  product -like  computations  arc  common  in  many  applications,  we  expect  to  find  more 
use  of  two-way  pipelining  in  the  future. 

Pipeline  Processing  of  Arithmetic  Operations 

One  of  the  most  successful  applications  of  pipeline  processing  has  been  in  the  execution  of 
arithmetic  operations.  Pipeline  algorithms  for  floating-point  addition,  multiplication,  division, 
and  square  root  have  been  discussed  and  reviewed  in  [Chen  75,  Ramamoorthy  and  Li  77]. 
For  these  algorithms,  the  connection  among  various  stages  of  the  “pipe*  is  by  and  large 
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Figure  3-7:  (a)  A  two-way  pipeline  algoiilhm  without  irregular  feedback  loops,  and 

(b)  the  inner  product  step  processor. 

linear,  although  additional  feedback  links  may  sometimes  be  present.  For  example,  the  Cray 
Research  CRAY-1  uses  6-stage  floating-point  adders  and  7-stage  floating-point  multipliers, 
and  the  CDC  STAR-100  uses  4-stage  floating-point  adders.  For  a  pipeline  floating-point 
adder,  the  pipe  typically  consists  of  stages  for  performing  exponent  alignment,  fraction  shift, 
fraction  addition,  and  normalization.  A  pipeline  arithmetic  unit  can  be  viewed  as  a  systolic 
machine  composed  of  linearly  connected  processors  that  are  capable  of  performing  a  set  of 
(different)  operations. 

The  pipeline  approach  is  ideal  for  situations  where  the  same  sequence  of  operations  will 
be  invoked  very  frequently,  so  that  the  start-up  time  to  initialize  and  fill  the  pipe  becomes 
relatively  insignificant.  This  is  the  case  when  the  machine  is  processing  long  vectors.  One  of 
the  main  concerns  in  using  pipeline  machines  such  as  the  CRAY-1  and  the  STAR-100  is  the 
average  length  of  the  vectors  to  be  processed  (see,  for  example,  [Voigt  77]). 

For  integer  arithmetic,  bits  in  the  input  operands  and  carries  generated  by  additions  are 
often  pipelined  (see,  e.g.  [Hallin  and  Flynn  72]).  The  following  pipeline  digit-adder  using  a 
linear  array  is  described  in  [Chon  75].  Suppose  that  we  want  to  add  two  integer  vectors  (Uj, 
U2,  .  .  .)  and  (Vj,  V2,  .  .  .),  and  that  Uj  -  Ujju,^  . .  .uih  and  Vj  -  VjjVj2.  .  .vjK  in  their  binary 
representations.  We  illustrate  how  the  adder  works  for  k  ■  3  in  Fig.  3-8.  The  Ujj  and  Vjj 
march  toward  the  processors  synchronously  as  shown. 


(b) 


At  each  cycle,  each  processor  sums  the  three  numbers  arriving  from  the  three  input  lines 
and  then  outputs  the  sum  and  the  carry  at  the  output  lines.  It  is  easy  to  check  that  with  the 
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Figure  3-8:  A  pipeline  integer  adder 


configuration  shown,  when  the  pair  (u^,  v—)  reaches  a  processor,  the  carry  needed  to 
produce  the  correct  j-th  digit  in  the  result  of  Uj+Vj,  will  also  reach  the  same  processor.  As  a 
result,  the  pipelined  adder  can  compute  a  sum  Uj  +  Vj  every  cycle  in  the  steady  state. 


3.3  Algorithms  Using  Two-Dimensional  Arrays 

We  restrict  ourselves  to  two-dimensional  communication  geometries  which  are  simple  and 
regular.  Consider  the  following  problem:  how  can  processors  be  distributed  in  a 
two-dimensional  area  so  that  they  can  be  mesh-connected  in  a  simple  and  regular  way,  in  the 
sense  that  the  connections  are  all  symmetric  and  of  the  same  length?  It  turns  out  that  there 
are  only  three  solutions  to  the  problem.  This  problem  is  related  to  that  of  finding  regular 
figures  which  can  close  pack  to  completely  cover  a  two-dimensional  area.  The  only  three 
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regular  (  jurco  which  possess  t liis  property  are  the  square,  the  heyagon  and  the  equilateral 
triangle  (see  Fig.  3-9).  In  the  following,  we  consider  algorithms  using  hexagonal  and  square 
arrays.  Interesting  algorithms  using  equilateral  triangular  arrays  are  yet  to  be  discovered. 


Figure  3-9:  The  three  types  of  regular  arrays: 

(a)  square  array,  (b)  hexagonal  array,  (c)  triangular  array. 


3.3.1  Algorithms  Using  Two-Dimonsional  Hexagonal  Arrays 

We  demonslrate  that  two  matrix  algorithms,  matrix  multiplication  and  LU-decompositicn,  can 
be  done  naturally  on  hexagonal  arrays.  The  basic  processor  used  by  these  two  algorithms  is 
the  inner  product  step  processor  (Fig.  3-10),  which  is  similar  to  the  ones  used  in  Section  3.2 
for  FIR  filtering  and  recurrence  evaluations.  The  processor  has  three  registers  R^,  Rg,  and 
Rq,  and  has  six  external  connections,  three  for  input  and  three  for  output.  In  each  unit  time 
interval,  the  processor  shifts  the  data  on  its  input  lines  denoted  by  A,  B  and  C  into  R^,  Rg 
and  Rq,  respectively,  computes  Rq  «-  Rq  +  R^  x  Rg,  and  makes  the  input  values  for  R^  and 
Rg  together  with  the  new  value  of  Rq  available  as  outputs  on  the  output  lines  denoted  by  A, 
B  and  C,  respectively.  All  outputs  are  latched  and  the  logic  is  clocked  so  that  when  one 
processor  is  connected  to  another,  the  changing  output  of  one  during  a  unit  time  interval  will 
not  interfere  with  the  input  to  another  during  this  time  interval.  This  is  not  the  only 
processing  element  we  shall  make  use  of,  but  it  will  be  the  work  horse.  A  special  processor 
for  computing  reciprocals  will  be  specified  later  when  it  is  used.  For  details  about  these  two 
algorithms  and  other  related  results,  see  [Kung  and  Leiserson  78,  Kung  and  Leiserson  79} 
The  hexagonal  array  connection  is  also  natural  for  computing  the  transitive  closure  of  a 
Boolean  matrix.  In  this  case,  the  inner  product  step  processor  computes  Rq  «-  Rq  v  a  Rg. 

Other  examples  of  compulations  using  hexagonal  arrays  include  QR-factorization  [Brent  and 
Kung  79a}  relational  database  operations  [Kung  and  Lehman  79a],  and  the  tally  circuit  [Mead 
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Figure  3-10:  The  inner  product  step  processor. 
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Figure  3-11:  Band  matrix  multiplication. 


It  is  easy  to  see  that  the  matrix  product  C  «  <Cj j)  of  A  ■  (ajj)  and  B 
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L,..  A  and  n  be  nxn  b.md  mat  r  it  os  of  band  width  wj  and  w2.  respectively.  We  show  how 
the  i  ecut  rerun',  above  tan  be  evaluated  by  pipelining  the  a^,  b -  and  C; ;  through  an  array  of 
wlw2  hex-connected  inner  product  step  processors.  We  illustrate  the  algorithm  by 
considering  the  band  matrix  multiplication  problem  in  Fig.  3-11.  The  diamond  shaped 
hexagonal  array  for  this  case  is  shown  in  Fig.  3-12,  where  arrows  indicate  the  directions  of 
data  flow. 

The  elements  in  the  bands  of  A,  3  and  C  march  through  the  network  in  three  directions 
synchronously.  Each  c,j  is  initialized  to  zero  as  it  enters  the  network  through  the  bottom 
boundaries.  (For  the  general  problem  of  computing  C  =  AB  +  0  where  D  =  (d-)  is  any  given 
matrix,  Cjj  should  be  initialized  as  djj.)  One  can  easily  see  that  with  the  inner  product  step 
processors  depicted  in  Fig.  3-10,  each  Cj j  is  able  to  accumulate  ail  its  terms  before  it  leaves 
the  network  through  the  upper  boundaries.  If  A  and  B  are  nxn  band  matrices  of  band  width 
Wj  and  w2,  respectively,  then  an  array  of  wjw2  hex-connected  processors  can  pipeline  the 
matrix  multiplication  AxB  in  3n+min(wj,  wg)  units  of  time.  If  A  and  B  are  nxn  dense  matrices 
then  3n2-3n+l  hex-connected  processors  can  compute  AxB  in  5(n-l)  units  of  time.  We 
mention  an  important  application  of  this  result.  It  is  well-known  that  an  n2-point  discrete 
Fourier  transform  (DFT)  can  be  computed  by  first  performing  n  independent  n-point  DFT’s 
and  then  using  the  results  to  perform  another  set  of  n  independent  n-point  OFFs.  The 
commit ation  of  anv  of  Ihrse  two  sets  of  n  independent  n-ooint  DFT’s  is  simolv  a  matrix 
multiplication  AxB,  where  the  <i,j)  entry  of  matrix  A  is  «('_l)(j-l)  and  u  is  a  primitive  nth  root 
of  unity.  Hence,  using  0(n2)  hex-connected  processors,  an  n2-point  DFT  can  be  computed  in 
0(n)  time. 

The  LU-Docomposition  of  a  Matrix 

The  problem  of  factoring  a  matrix  A  into  lower  and  upper  triangular  matrices  L  and  U  is 
called  LU-decomposition.  Figure  3-13  illustrates  the  lU-decomposition  of  a  band  matrix  with 
p  -  4  and  q  -  4.  Once  the  L  and  U  factors  are  known,  it  is  relatively  easy  to  invert  A  or  to 
solve  the  linear  system  Ax  *  b. 

We  assume  that  matrix  A  has  the  property  that  its  LU-decomposition  can  be  done  by 
Gaussian  elimination  without  pivoting.  (This  is  true,  for  example,  when  A  is  a  symmetric 
positive-definite,  or  an  irreducible,  diagonally  dominant  matrix.)  The  triangular  matrices 
L  -  (Ijj)  and  U  -  (ujj)  are  evaluated  according  to  the  following  recurrences. 
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Figure  3-12:  The  hexagonal  array  for  the  matrix  multiplication  problem  in  Fig.  3-11 
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Figure  3-13:  The  LU-decomposition  of  a  band  matrix. 
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It  turns  out  that  the  evaluation  of  these  recurrences  can  be  pipelined  on  a  hexagonal  array. 
A  global  view  of  this  pipelined  computation  is  shown  in  Fig.  3-14  for  the  LU-decomposition 
problem  in  Fig.  3-13.  The  array  in  Fig.  3-14  is  constructed  as  follows.  The  processors  below 
the  upper  boundaries  arc  the  standard  inner  product  step  processors  and  are  hex-connected 
exactly  the  same  as  the  matrix  multiplication  network  presented  above.  The  processor  at  the 
top,  denoted  by  a  circle,  is  a  special  processor.  It  computes  the  reciprocal  of  its  input  and 
pumps  the  result  southwest,  and  also  pumps  the  same  input  northward  unchanged.  The  other 
processors  on  the  upper  boundaries  are  again  inner  product  step  processors,  but  their 
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Figure  3-14:  The  hexagonal  array  (or  pipelining  the  tU-decomposition 
of  the  band  matrix  in  Fig.  3-13. 
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orientation  is  changed:  the  ones  on  the  upper  left  boundary  are  rotated  120  degrees 
clockwise;  the  ones  on  the  upper  right  boundary  are  rotated  120  degrees  counterclockwise. 
The  flow  of  data  in  the  array  is  indicated  by  arrows  in  the  figure. 

If  A  is  an  nxn  band  matrix  with  band  width  w  =  p+q-1,  an  array  having  no  more  than  pq 
hex-connected  processors  can  compute  the  LU-decomposition  of  A  in  3n+min(p,q)  units  of 
time.  If  A  is  an  nxn  dense  matrix,  then  an  nxn  hexagonal  array  can  compute  the  L  and  U 
matrices  in  4n-2  units  of  time  which  includes  I/O  time.  The  remarkable  fact  that  the  matrix 
multiplication  network  forms  a  major  part  of  the  LU-decomposition  network  is  due  to  the 
similarity  of  the  defining  recurrences. 

Transitive  Closure 


Given  a  Boolean  matrix  A=(ajj),  the  transitive  closure  of  A  can  be  computed  from  the 
recurrence: 


»!/>  - 

•r» 


■ij> 


»!!°  v  i'll:1  A  4V1' 


(see,  e.g.,  [Aho  et  al.  75]).  We  observe  that  this  recurrence  is  analogous  to  the  recurrence 
for  matrix  multiplication  or  LU-decomposition,  as  far  as  structures  for  subscripts  and 
superscripts  are  concerned.  This  suggests  that  we  use  hexagonal  arrays  to  solve  the 
transitive  closure  problem,  too.  Indeed,  an  efficient  transitive  closure  algorithm  using  the 
hexagonal  array  has  recently  been  discovered.  The  algorithm  differs  from  the  matrix 
multiplication  and  LU-decomposition  algorithm  in  that  it  computes  the  solution  in  two  passes 
rather  than  one  pass.  A  full  description  of  the  algorithm  will  appear  in  the  revised  version  of 
[Guibas  et  al.  79]. 


3.3.2  Algorithms  Using  Two-dimensional  Square  Arrays 

The  square  array  is  perhaps  one  of  the  first  communication  geometries  studied  by 
researchers  who  were  interested  in  parallel  processing.  Work  in  cellular  automata,  which  is 
concerned  with  computations  distributed  in  a  two-dimensional  orthogonally  connected  array, 
was  initiated  by  von  Neumann  in  the  early  fifties  [Von  Neumann  66].  Theorists  in  cellular 
automata  have  been  traditionally  interested  in  the  "power"  of  a  cellular  automaton  system 
using,  say,  a  particular  number  of  states  at  each  cell.  More  recently,  because  of  the  advent 
of  LSI  technology,  there  has  been  an  increasing  interest  in  designing  algorithms  for  cellular 
arrays.  Cellular  algorithms  for  pattern  recognition  have  been  proposed  in  [Smith 
71,  Kosaraju  75,  Foster  and  Kung  79],  for  graph  problems  in  [Levitt  and  Kautz  72],  for 
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switching  in  [Knutz  et  al.  6 8],  for  sorting  in  [Thompson  and  Kung  77],  and  for  dynamic 
programming  in  [Guibas  et  al.  79].  The  algorithms  for  dynamic  programming  in  [Guibas  et  al. 
79]  are  quite  special  in  that  they  involve  data  being  transmitted  at  two  different  speeds, 
which  give  the  effect  of  "time  reverse"  for  the  order  of  certain  results.  The  pattern  matching 
chip  described  in  [Foster  and  Kung  79]  has  recently  been  designed  and  fabricated. 

In  parallel  to  the  developments  of  cellular  algorithms  for  solving  combinatorial  problems, 
there  have  been  major  activities  in  using  the  array  structure  for  solving  large  numerical 
problems.  Many  of  these  activities  are  motivated  or  influenced  by  the  ILIIAC  IV  computer, 
which  has  an  Sx8  processor  array  (see  [Kuck  68]).  Relaxation  methods  for  solving  partial 
differential  equations  match  the  square  array  structure  naturally.  Typically,  the  variable  Ujj 
representing  the  solution  at  mesh  point  (i,  j)  is  updated  by  a  difference  equation  of  the  form: 

uij  ’  F(ui+1,  j*  ui-l,j*  ui.j*l'ui,j-l>- 

Hence,  if  Ujj  is  stored  at  processor  (i,  j)  of  the  processor  array,  then  each  update  (or 
iteration)  involves  communications  only  among  neighboring  processors.  The  central  control 
provided  by  SIMD  machines  such  as  the  ILLIAC  IV  is  useful  for  broadcasting  relaxation  and 
termination  parameters,  which  are  often  needed  in  these  relaxation  methods.  Relaxation 
algorithms  on  two-dimensional  grids  are  also  used  in  image  processing,  for  which  mesh  points 
correspond  to  pixels  [Peleg  and  Rosenfeld  78]. 


3.4  Algorithms  Using  Tree  Structures 

The  tree  structure,  shown  in  Fig.  3-15  (a),  has  the  nice  property  that  it  supports 
logarithmic-time  broadcast,  search,  and  fan-in.  Fig.  3-15  (b)  shows  an  interesting  "H"  shaped 
layout  of  a  binary  tree,  which  is  convenient  for  placement  on  a  chip. 

Unlike  the  array  structures  considered  earlier,  the  connections  in  the  tree  structure  are 
not  uniform.  The  distance  between  two  connecting  processors  increases  as  they  move  up  to 
the  root.  For  chip  implementation,  the  time  that  it  takes  a  signal  to  propagate  along  a  wire 
can  nevertheless  be  made  independent  of  the  length  of  the  wire,  by  fitting  larger  drivers  to 
longer  wires.  Thus,  by  using  appropriate  drivers  the  logarithmic  property  of  the  tree 
structure  can  still  be  maintained.  It  is  demonstrated  in  [Mead  and  Rem  79]  that  in  spite  of  the 
fact  that  large  drivers  take  large  areas,  with  the  layout  in  Fig.  3-15  (b)  it  is  possible  to 
implement  a  tree  using  a  total  chip  area  essentially  proportional  to  the  number  of  processors 
in  the  tree.  Moreover,  in  this  implementation  drive  currents  ramp  up  from  the  leaves  to  the 
root,  and  consequently,  off-chip  communication  can  be  conducted  at  the  root  without  serious 
delay.  In  the  following,  we  shall  assume  that  the  time  to  send  a  data  item  across  any  link  in 
the  tree  is  constant,  and  that  the  root  of  the  tree  is  the  I/O  node  for  outside  world 
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Figure  3-15:  (a)  A  binary  tree  structure,  and  (b)  embedding  a  binary  tree 

in  a  two-dimensional  grid. 


communication. 

The  logarithmic-time  property  for  broadcasting,  searching,  and  fan-in  is  the  main  advantage 
provided  by  the  tree  structure  that  is  not  shared  by  any  array  structure.  The  tree  structure, 
however,  has  the  following  possible  drawback.  Processors  at  high  levels  of  the  tree  may 
become  bottlenecks  if  the  majority  of  communications  are  not  confined  to  processors  at  low 
levels.  We  are  interested  in  algorithms  that  can  take  advantage  of  the  power  provided  by 
the  tree  structure  while  avoiding  its  drawback. 

Search  Algorithms 

The  tree  structure  is  ideal  for  searching.  Assume,  for  example,  that  information  stored  at 
the  leaves  of  a  tree  forms  the  data  base.  Then  we  can  answer  questions  of  the  following 
kinds  rapidly:  "What  is  the  nearest  neighbor  of  a  given  element?",  "What  is  the  rank  of  a 
given  element?",  "Does  a  given  element  belong  to  a  certain  subset  of  the  data  base?"  The 
paradigm  to  process  these  queries  consists  of  three  phases:  (i)  the  given  element  is 
broadcast  from  the  root  to  leaves,  (ii)  the  element  is  compared  to  some  relevant  data  at 
every  leaf  simultaneously,  and  (iii)  the  comparison  results  from  all  the  leaves  are  combined 
into  a  single  answer  at  the  root,  through  some  fan-in  process.  It  should  be  clear  that  using 
the  paradigm  and  assuming  appropriate  capabilities  of  the  processors,  queries  like  the  ones 
above  can  all  be  answered  in  logarithmic  time.  Furthermore,  we  note  that  when  there  are 
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rn.-.ny  queries,  it  is  possible  to  pipeline  them  on  the  tree.  See  [Bentley  and  Kung  79]  for 
discussions  of  using  the  tree-structured  machine  for  many  searching  problems. 

A  similar  idea  has  been  pointed  out  in  [Browning  79].  Algorithms  which  first  generate  a 
large  number  of  solution  candidates  and  then  select  from  among  them  the  true  solutions  can 
be  supported  by  the  tree  structure.  NP-complete  problems  [Karp  72]  such  as  the  clique 
problem  and  the  color  cost  problem  are  solvable  by  such  algorithms.  One  notes  immediately 
that  with  this  approach  an  exponential  number  of  processors  will  be  needed  to  solve  an 
NP-compIcte  problem  in  polynomial  time.  However,  with  the  emergence  of  very  large  scale 
integration  (VLSI)  technology  this  brute  force  approach  may  gain  importance.  Here  we 
merely  wish  to  point  out  that  the  tree  structure  matches  the  structure  of  some  algorithms 
that  solve  NP-complete  problems. 

Systolic  Search  Tree 

As  one  is  thinking  about  applications  using  trees,  data  structures  such  as  search  trees  (see, 
e.g.,  [Aho  et  at.  75,  Knuth  73])  will  certainly  come  to  mind.  The  problem  is  how  to  embed  a 
balanced  search  tree  in  a  network  of  processors  connected  by  a  tree  so  that  the  Oflog  n) 
performance  for  Ihe  INSERT,  DELETE,  and  FIND  operations  can  be  maintained.  The  problem  is 
nontrivial  because  most  balancing  schemes  require  moving  pointers,  but  the  movement  of 
pointers  is  impossible  in  a  physical  tree  where  pointers  are  fixed  wires.  To  get  the  effect  of 
balancing  in  the  physical  tree,  data  rather  than  pointers  must  be  moved.  Common  balanced 
tree  schemes  such  as  AVL  trees  and  2-3  trees  do  not  map  well  onto  the  tree  network 
because  data  movements  involved  in  balancing  are  highly  non-local.  A  new  organization  of  a 
hardware  search  tree,  called  a  systolic  search  tree,  was  recently  proposed  by  [Leiserson  79], 
on  which  the  data  movements  for  balancing  are  always  local  so  that  the  desired  O(log  n) 
performance  can  be  achieved.  In  Leiserson’s  paper  an  application  of  using  the  systolic 
search  tree  as  a  common  storage  for  a  collection  of  disjoint  priority  queues  is  discussed. 

Evaluation  of  Arithmetic  Expressions  and  Recurrences 

Another  application  of  the  tree  siructure  is  its  use  for  evaluating  arithmetic  expressions. 
Any  expression  o?  n  variables  can  be  evaluated  by  a  tree  of  at  most  4riog2nl  levels  [Brent 
74],  but  the  lime  to  input  Ihe  n  variables  to  the  tree  from  the  root  is  still  0(n).  This  input 
time  can  often  be  overlapped  with  the  computation  time  ir.  the  case  of  recurrences  evaluation 
(see  [Kung  79]). 
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3.5  Algorithms  Using  Shuffle-Exchange  Networks 

Consider  a  network  having  n=2m  nodes,  where  m  is  an  integer.  Assume  that  nodes  are 
named  as  0,  1,  .  .  .,  2m-l.  Let  j  ■  . .  ij  denote  the  binary  representation  of  any  integer  i, 
0  <  i  <  The  shuffle  function  is  defined  by 

•  •  •  'P  a  'm-l'm-2  *  •  •  'I'm* 
and  the  exchange  function  is  defined  by 

^'m'm-1  •  •  •  ip  =  'm'm-1  •  •  •  *2'l’ 

The  netv/ork  is  called  a  shuffle-exchange  network  if  node  i  is  connected  to  node  S(i)  for  all  i, 
and  to  node  E(i)  for  ail  even  i.  It  is  often  convenient  to  view  each  pair  of  nodes  connected 
by  the  exchange  (unction  as  a  2x2  processor  which  has  two  input  ports  and  two  output 
ports.  Fig.  3-16  illustrates  the  shuffle  function  for  the  case  when  n-2m=8. 
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Figure  3-16:  The  shuffle  function  and  2x2  processors  for  the  case  when  n»8. 


Observe  that  for  i«J,  ...,  m,  by  executing  the  shuffle  function  i  times,  data  originally  at  two 
nodes  whose  names  differ  by  2m~ 1  can  be  brought  to  the  same  2x2  processor.  This  type  of 
communication  happens  to  be  natural  to  a  number  of  algorithms.  It  was  shown  by  [Batcher 
68]  that  the  bilonic  sort  of  n  elements  can  be  carried  out  in  Odog^  n)  steps  on  the 
shuffle-exchange  network  when  the  2x2  processors  are  capable  of  performing 
comparison-exchange  operations.  It  was  shown  by  [Pease  68]  that  the  n-point  fast  Fourier 
transform  (FFT)  can  be  done  in  Oflog  n)  steps  on  the  network  when  the  2x2  processors  are 
capable  o(  doing  addition  and  multiplication  operations.  Other  applications  including  matrix 
transposition  and  linear  recurrence  evaluation  are  given  in  [Stone  71,  Stone  75].  The  two 
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articles  by  Stone  give  clear  expositions  lor  ail  these  algorithms  and  have  good  discussions  on 
the  basic  idea  behind  them.  Here  we  illustrate  the  use  of  the  network  for  performing  the 
S-point  FFT.  The  computation  has  three  stages.  Stage  i,  i»l,  2,  3  involve  combining  data 
from  two  nodes  whose  names  differ  by  2^*'.  This  is  indicated  by  the  graph  in  Fig.  3-17  (a). 
A  topologically  equivalent  graph  is  shown  in  Fig.  3-17  (b).  The  latter  graph  demonstrates  the 
fact  that  the  computation  at  each  stage  can  be  done  entirely  inside  the  2x2  processors, 
provided  that  resuits  from  the  previous  stage  have  been  "shuffled".  Note  that  the  same 
shuffle  network  can  be  used  (or  shuffling  inputs  for  aM  the  stages  if  so  desired. 


Figure  3-17:  (a)  The  communication  structure  of  the  8-point  FFT,  and 
(b)  its  realization  by  the  shuffle-exchange  network. 


Many  powerful  rearrangeable  permutation  networks,  such  as  those  in  [Benes  65]  which  are 
capable  of  performing  all  possible  permutations  in  Odog  n)  delays,  can  be  viewed  as 
multi-stage  shuffle-exchange  networks  (see,  e.g.,  [Kuck  78]).  The  shuffle-exchange  network, 
perhaps  due  to  its  great  power  in  permutation,  suffers  from  the  drawback  that  it  has  a  very 
low  degree  of  regularity  and  modularity.  Indeed,  it  was  recently  shown  by  [Thompson  79a] 
that  the  network  is  not  planar  and  cannot  be  embedded  in  silicon  using  area  linearly 
proportional  to  the  number  of  nodes  in  the  network. 
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3.6  Remarks  for  Section  3 

For  a  fixed  problem,  it  is  often  possible  to  design  algorithms  using  different  communication 
topologies.  A  good  example  of  this  is  the  sorting  problem.  A  performance  hierarchy  for 
sorting  n  elements  on  n  processors  connected  by  various  networks  is  given  in  Fig.  3-18 
[Knuth  73,  Thompson  and  Kung  77,  Batcher  68]. 


NETWORK 

SORTING  TIME 

1  -dim  array 

0(n) 

2-dim  array 

©(n1/2) 

k-dini  array 

©(n1^) 

Shuffle-Exchange 

0(log2n) 

Figure  3-18:  Sorting  times  on  various  networks 


Each  of  these  algorithms  can  be  useful  under  appropriate  circumstances.  For  a  discussion 
on  the  related  problem  of  mapping  a  given  algorithm  (rather  than  a  given  problem)  on 
different  networks,  see  [Kung  and  Stevenson  77]. 

Sorting  can  also  be  done  on  the  tree  structure  in  0(n)  time  in  a  straightforward  way.  But 
the  same  performance  is  achievable  by  the  simpler  one -dimensional  linear  array  using  the 
priority  queue  approach  (cl.  Section  3.1.).  For  this  reason,  we  did  not  include  tree  sort  as 
one  of  the  algorithms  for  the  tree  network  in  Section  3.4.  The  general  guideline  we  have 
been  using  in  this  section  for  choosing  algorithms  under  a  given  communication  structure  is  as 
follows:  An  algorithm  is  included  only  if  it  uses  the  structure  effectively,  in  the  sense  that 
the  same  performance  does  not  seem  to  be  possible  on  a  simpler  structure.  One  should  note, 
however,  that  sometimes  it  may  be  worthwhile  to  consider  solving  a  problem  on  some 
network  which  is  not  inherently  best  suited  for  the  problem.  For  instance,  at  an  installation  a 
fixed  network  may  have  to  be  used  for  solving  a  set  of  rather  incompatible  problems. 

Up  to  this  point,  we  have  been  considering  almost  exclusively  the  case  when  there  are 
enough  processors  for  the  problem  one  wants  to  solve.  The  only  exception  is  that  for  the 
odd-transposition  sort  we  discussed  how  to  sort  n  elements  by  k  processors  where  k<n,  and 
concluded  that  a  near-optimal  speed  up  ratio  can  be  achieved  if  k«n.  In  general,  there  are 
three  approaches  one  can  take  for  solving  a  large  problem  on  a  small  network. 

i.  Use  algorithms  with  large  module  granularity.  Each  processor  handles  a  large 
group  of  elements  rather  than  a  few  elements.  For  the  odd-even  transposition 
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sort  mentioned  al>r»>'e,  a  subsequence  consisting  of  n/k  elements  is  stored  in 
each  processor,  for  matrix  problems,  a  row,  a  column,  or  a  submatrix  may  be 
stored  in  a  processor.  This  approach  is  suitable  for  S1MD  machines  where 
processors  can  have  relatively  large  local  memories.  In  this  case,  one  must 
carefully  design  the  global  data  structure,  which  is  now  distributed  over  the  local 
memories,  so  as  to  ensure  that  needed  memory  accesses  can  be  performed  in 
parallel  without  conflicts  [Lawrie  75,  Kuck  731. 

ii.  Decompose  the  problem.  The  idea  is  that  after  decomposition  each  subproblem 
will  be  small  enough  so  that  it  can  be  solved  on  the  given  small  network  of 
processors.  A  matrix  multiplication  involving  large  matrices,  for  example,  can  be 
done  on  a  small  network  by  performing  a  sequence  of  matrix  multiplications 
involving  submatrices. 

iii.  Decompose  an  algorithm  [hat  originally  requires  a  large  network.  Simultaneous 
operations  invoked  in  one  step  of  the  original  algorithm  are  now  carried  out  in  a 
number  of  steps  by  the  small  network.  With  this  approach,  the  LU-decomposition 
algorithm  for  an  nxn  matrix  in  section  3.3.1  can  be  performed  on  a  kxk  hexagonal 
array  in  0(n^/k^)  time,  when  n  is  large  and  k  is  fixed. 

Using  one  of  the  three  approaches,  one  should  be  able,  in  principle,  to  design  algorithms  for 
small  networks  to  solve  large  problems. 
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4.  Algorithms  for  Asynchronous  Multiprocessors 

4.1  Introduction 

In  this  section  we  consider  parallel  algorithms  for  an  asynchronous  mimd  multiprocessor 
like  C.mmp  or  Cm*,  which  is  composed  of  a  number  of  independent  processors  sharing  the 
primary  memory  by  means  of  a  switch  or  connecting  network.  Such  an  algorithm  will  be 
viewed  as  a  collection  of  cooperating  processes  that  may  execute  simultaneously  in  solving  a 
given  problem.  It  is  important  to  distinguish  between  the  notion  of  process,  which 
corresponds  to  the  execution  of  a  program,  and  the  notion  of  processor,  which  is  a  functional 
unit  by  which  a  process  can  be  carried  out.  At  the  decision  of  the  operating  system,  the 
same  process  may  be  executed  by  any  processor  at  a  given  time. 

In  the  design  and  analysis  of  parallel  algorithms  for  asynchronous  multiprocessors,  one 
should  assume  that  the  time  required  to  execute  the  steps  of  a  process  is  unpredictable 
[Kung  76],  Based  on  measurements  obtained  from  C.mmp,  six  major  sources  for  causing 
fluctuations  in  execution  times  have  been  identified  [Oleinick  78].  The  six  sources  include 
variations  in  computation  time  due  to  different  instances  of  inputs,  memory  contention, 
operating  system’s  scheduling  policies,  variations  in  the  individual  processor  speeds,  etc.  This 
asynchronous  behavior  leads  to  serious  issues  regarding  the  correctness  and  efficiency  of  an 
algorithm.  The  correctness  issue  arises  because  during  the  execution  of  an  algorithm 
operations  from  different  processes  may  interleave  in  an  unpredictable  manner.  The 
efficiency  issue  arises  because  any  synchronization  introduced  for  correctness  reasons  takes 
extra  time  and  also  reduces  concurrency.  In  the  following,  we  shall  examine  various 
techniques  for  dealing  with  the  correctness  and  efficiency  issues  that  are  encountered  in  the 
use  of  asynchronous  multiprocessors. 

Asynchronous  multiprocessors  can  support  truly  concurrent  database  systems,  where 
simultaneous  access  to  a  database  by  more  than  one  process  is  possible.  Recent  research 
results  concerning  the  integrity  of  rnulti-user  database  systems  are  directly  applicable  to 
concurrent  database  systems.  Some  of  these  results  will  be  examined  in  Section  4.2.  A 
concurrent  database  system  can  be  viewed  as  an  asynchronous  algorithm  consisting  of 
processes  that  execute  so-called  transactions.  In  designing  a  general  database  system,  one 
usually  has  little  control  over  the  set  of  transactions  that  will  be  allowed  to  run  in  the 
system.  However,  in  designing  an  algorithm  to  solve  a  fixed  problem,  one  does  have  control 
over  the  tasks  to  be  included  in  the  algorithm.  As  a  result,  it  is  often  possible  to  design 
parallel  algorithms  without  costly  synchronizations  for  solving  specific  problems.  We  shall 
consider  several  of  these  highly  efficient  algorithms  in  Section  4.3.  Finally,  in  Section  4.4,  we 
shall  discuss  some  of  the  guidelines  for  designing  efficient  algorithms  for  asynchronous 
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multiprocessors. 


4.2  Concurrent  Database  Systems 

In  a  concurrent  database  system,  a  number  of  on-line  transactions  are  allowed  to  run 
concurrently  on  a  shared  database.  One  of  the  important  issues  arising  from  the  concurrent 
execution  of  transactions  is  the  consistency  problem.  A  database  is  said  to  be  consistent  if 
all  integrity  constraints  defined  for  the  data  are  met.  A  transaction  is  said  to  be  correct  if 
starting  from  a  consistent  state  the  execution  of  the  transaction  will  terminate  and  preserve 
consistency.  A  concurrent  execution  of  several  correct  transactions  may,  however,  transform 
a  consistent  database  into  an  inconsistent  one!  We  illustrate  this  fact  with  a  simple  example. 
Suppose  that  the  integrity  constraint  is  x  >  0.  Then  the  transaction,  if  x  >  1  then  x  *-x  -  1, 
is  correct,  but  the  concurrent  execution  of  two  such  transactions  may  transform  a  consistent 
state,  *  =>  2,  into  an  inconsistent  state,  2=0.  The  mechanism  in  a  concurrent  database 
system  that  safeguards  database  consistency  is  usually  called  a  "concurrency  control".  (In 
Section  2,  we  have  used  the  came  term  with  a  more  general  meaning.) 

There  have  been  two  major  approaches  in  contending  with  the  consistency  problem.  The 
first  approach,  discussed  in  Section  4.2.1  below,  is  the  "serialization  approach",  which 
requires  no  Knowledge  of  the  integrity  constraints,  but  does  require  syntactic  information 
about  the  transactions.  The  second  approach,  considered  in  Section  4.2.2,  will  use  specific 
knowledge  of  the  integrity  constraints  to  construct  correct  and  hopefully  more  efficient 
concurrent  database  systems.  In  [Kung  and  Papadimitriou  79]  maximum  degrees  of 
concurrency  are  proved  to  depend  upon  the  types  of  knowledge  that  are  available. 

Besides  the  consistency  issue,  there  are  a  number  of  other  important  issues  concerning 
concurrent  database  systems.  Among  them  is  the  recovery  problem.  Solution  of  the 
recovery  problem  often  closely  related  to  solutions  to  the  consistency  problem.  The 
recovery  problem  will  not  be  explicitly  treated  in  this  paper.  The  reader  is  referred  to  [Gray 
78]  for  a  good  discussion  of  recovery. 


4.2.1  The  Serialization  Approach 

Throughout  our  discussion,  transactions  are  assumed  to  be  correct  in  the  sense  that  they 
preserve  database  consistency  when  executed  alone.  Serial  execution  of  a  set  of 
transactions  is  one-transaction-at-a-time  execution.  It  preserves  consistency,  since  the 
execution  of  each  transaction  does  so  (see  Fig.  4-1). 

The  serialization  approach  makes  sure  that  a  concurrent  execution  has  the  same  overall 
effect  as  some  serial  execution  and  therefore  preserves  consistency.  This  approach  is  very 
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CONSISTCNT  T:  CONSISTENT  T;  ...  Tu  CONSISTENT 

- 4*  — Is. 

STATE  STATE  STATE 

Figure  4-1:  A  ieri.il  execution  of  correct  transactions,  Tj,  T 2  .  .  which 

preserves  consistency. 

general  in  the  sense  that  it  applies  to  any  concurrent  database  system  and  requires  no 
information  on  the  semantics  of  the  transactions  and  integrity  constraints.  In  fact,  it  has  been 
shown  in  [Kung  and  Papadimitriou  79]  that  serialization  is  the  weakest  criterion  for 
preserving  consistency  if  only  syntactic  information  can  be  used. 


4.2. 1.1  The  Two-Phase  Transaction  Method 

In  [Eswaran  et  at.  76]  a  serialization  method  is  proposed  in  which  each  transaction  employs 
a  locking  protocol  to  insure  that  it  "sees"  only  a  consistent  state  of  the  database.  Here  we 
briefly  describe  their  scheme.  It  is  assumed  that  a  transaction  must  have  a  share  lock  or 
exclusive  lock  on  any  entity  it  is  reading,  and  an  exclusive  lock  on  any  entity  it  is  writing. 
Fig.  4-2  shows  the  compatibility  among  the  lock  modes. 


SHARE 

EXCLUSIVE 

SHARE 

YES 

NO 

EXCLUSIVE 

NO 

NO 

Figure  4-2:  Compatibilities  among  lock  modes. 

A  transaction  is  a  two-phase  transaction  if  it  does  not  request  new  locks  after  releasing  a 
lock.  Hence  a  Iwo-phase  transaction  consists  of  a  growing  phase  during  which  it  requests 
locks,  and  a  shrinking  phase  during  which  it  releases  locks.  A  schedule  of  a  set  of  concurrent 
transactions  is  a  history  of  the  order  in  which  statements  in  the  transactions  are  executed.  A 
schedule  can  be  totally  ordered  or  partially  ordered.  The  latter  case  corresponds  to  the 
multiprocessor  environment  where  a  set  of  statements  from  different  transactions  can  be 
executed  simultaneously  by  a  number  of  processors.  A  serial  schedule  is  a  schedule 
corresponding  to  a  serial  execution  of  the  transactions.  A  schedule  is  legal  if  it  does  not 
schedule  a  lock  action  on  an  entity  for  one  transaction  when  that  entity  is  already  locked  by 
some  other  transaction  in  a  conflicting  mode.  In  Fig.  4-3  we  illustrate  a  possible  legal 
schedule  of  two-phase  transactions  T j  and  T2. 

The  numbering  in  the  left  hand  side  specifies  the  execution  order  of  the  schedule  (so  the 
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1. 

exclusive  lock  x 

2. 

exclusive  lock  y 

3. 

read  x 

4. 

read  y 

5. 

write  x 

6. 

unlock  x 

7. 

8. 

^exclusive  lock 

(ax  .. 

[write  x 

9. 

10. 

fwrite  y 
^  ^unlock  y 

11. 

share  lock  y 

12. 

unlock  x 

13. 

read  y 

14. 

unlock  y 

Figure  4-3:  A  legal  schedule  of  two-phase  transactions  Tj  and  T2. 

schedule  is  totally  ordered).  Note  that  actions  in  (A)  are  independent  from  actions  in  (B)  in 
the  sense  that  (A)  and  (B)  involve  disjoint  variables.  Thus,  the  input  and  output  of  any  action 
in  (A)  or  (B)  is  unchanged  if  actions  in  (A)  precede  actions  in  (B)  instead.  This  implies  that  the 
effect  of  the  schedule  is  the  same  as  that  of  the  serial  schedule  that  executes  Tj  first  and 
then  T2.  Theorem  4-1  below  asserts  that  the  same  phenomenon  holds  for  any  legal  schedule 
of  two-phase  transactions.  To  understand  the  theorem,  we  need  to  introduce  some  additional 
terminology.  The  dependency  r.raph  of  a  schedule  is  a  directed  graph  whose  nodes  are 
transaction  names  and  whose  arcs,  which  are  labeled,  indicate  how  transactions  depend  on 
each  other.  More  precisely,  an  arc  from  Tj  to  Tj  exists  if  and  only  if  during  the  execution  Tj 
reads  an  entity  Tj  has  written  or  Tj  writes  an  entity  Tj  has  read  or  written,  and  the  label  of 
the  arc  in  this  case  is  the  name  of  the  entity.  The  dependency  graph  of  a  schedule 
completely  determines  the  state  of  the  database  each  transaction  "sees"  when  transactions 
are  executed  according  to  the  schedule.  We  say  two  schedules  are  equivalent  if  they  have 
the  same  dependency  graph.  We  state  the  main  theorem  regarding  the  two-phase  transaction 
method: 

Theorem  4-1:  Any  legal  schedule  of  two-phase  transactions  is  equivalent  to  a 
serial  schedule. 

The  theorem  implies  that  at  the  termination  of  any  concurrent  execution  of  two-phase 
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tr ansactions  the  consistency  of  the  database  is  maintained.  A  concurrent  execution  of 
two-phase  transaction-,  may  tend  to  a  deadlock  however.  In  this  case,  after  the  deadlock  is 
detected,  any  transaction  on  t lie  deadlock  cycle  can  be  backed  up.  3ec3use  all  Ihe 
transactions  are  two-phase  locked,  it  is  guaranteed  (why?)  that  backing  up  a  transaction  for 
breaking  a  deadlock  will  neither  cause  other  transactions  to  lose  updates,  nor  require  backing 
up  other  transactions. 

Much  insight  into  locking  can  be  gained  by  a  simple  geometric  method  [Kung  and 
Papadimitriou  79].  Consider  the  concurrent  execution  of  two  transactions  Tj  and  T2.  Any 
state  of  progress  towards  Ihe  completion  of  Tj  and  Tj  can  be  viewed  as  a  point  in  the 
two-dimensional  "progress  space",  as  shown  in  Fig.  4-4. 


T2  '  F 


Figure  4-4:  The  "progress  space"  for  transactions  Tj  and  T2. 

A  schedule  of  Tj  and  T2  corresponds  to  a  nondecreasing  curve,  called  a  progress  curve, 
from  the  origin  to  point  F.  The  progress  curves  tying  on  Ihe  two  boundaries,  OT2F  and  OTjF, 
represent  the  two  serial  schedules.  Locking  has  the  effect  of  imposing  restrictions  in  the 
form  of  forbidden  rectangular  regions  (see  the  two  blocks  in  Fig.  4-4).  It  is  easy  to  see  that 
a  schedule  is  legal  if  and  only  if  its  progress  curve  avoids  all  blocks.  Region  D  in  the  figure  is 
a  deadlock  region,  in  the  sense  that  any  progress  curve  trapped  in  the  region  will  not  be  able 
to  reach  F.  The  important  observation  is  that  two  schedules  are  equivalent  if  and  only  if 
their  progress  curves  are  not  separated  by  any  block.  Consequently,  if  all  the  blocks  are 
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connected  as  in  Fig.  4-4,  then  any  legal  schedule  {which  avoid  blocks)  must  be  equivalent  to 
some  serial  schedule.  The  idea  of  two-phase  transactions  is  now  extremely  easy  to  explain. 
It  siniply  keeps  all  blocks  connected  by  letting  them  have  a  point  u  in  common.  The 
coordinates  Uj,  U2  of  u  are  the  phase-shift  points,  at  which  all  locks  have  been  granted,  and 
none  have  been  released. 


4. 2. 1.2  Validation  Mothods  -  An  Optimistic  Approach 


Validation  methods  represent  another  general  approach  for  achieving  serialization  [Kung 
and  Robinson  79}.  The  methods  rely  on  transaction  backup  rather  than  locking  as  a  control 
mechanism.  The  methods  are  “optimistic"  in  the  sense  that  they  “hope"  that  conflicts  between 

transactions  will  not  occur  and  thus  transaction  backup  will  not  be  necessary.  The  idea 

behind  this  optimistic  approach  is  quite  simple,  and  may  be  summarized  as  follows: 

-  Since  reading  a  value  or  a  pointer  from  a  node  can  never  cause  a  loss  of 

integrity,  reads  are  completely  unrestricted  (however,  returning  a  result  from  a 
query  is  considered  to  be  equivalent  to  a  write,  and  so  is  subject  to  validation  as 
discussed  below). 

-  Writes  are  severely  restricted.  It  is  required  that  any  transaction  consist  of  two 
or  three  phases:  a  read  phase,  a  validation  phase,  and  a  possible  write  phase 
(see  Fig.  4-5).  During  the  read  phase,  alt  writes  take  place  on  local  copies  of  the 
nodes  to  be  modified.  Then,  if  it  can  be  established  during  the  validation  phase 
that  the  changes  the  transaction  made  will  not  cause  a  loss  of  integrity,  the  local 
copies  are  made  global  in  the  write  phase.  In  the  case  of  a  query,  it  must  be 
determined  that  the  result  the  query  would  return  is  actually  correct.  The  step 
in  which  it  is  determined  that  the  transaction  will  not  cause  a  loss  of  integrity  (or 
that  it  will  return  the  correct  result)  is  called  validation. 


validation  write. 
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Figure  4-5:  The  three  phases  of  a  transaction  T. 
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We  use  Fig.  4-6  to  illustrate  how  validation  works.  Suppose  that  transaction  T2  completes 
its  write  phase  by  time  t^.  At  time  transaction  Tj  finishes  its  read  phase  and  starts  its 
validation.  If  as  far  as  the  writes  of  Tj  are  concerned,  Tj  can  be  thought  of  as  if  it  started 
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after  Tg  had  been  validated,  then  Tj  will  be  validated  in  our  scheme.  This  is  the  case  when 
the  write  set  of  Tg  (Hie  set  of  variables  Tg  writes)  and  the  read  set  of  Tj  (the  set  of 
variables  Tj  reads)  are  disjoint.  Assume  that  Tj  is  successfully  validated  at  time  tg,  and  Tg 
finishes  its  read  phase  at  time  t^.  For  validating  Tg,  the  set  of  variables  Tg  reads  and  the 
set  of  variables  Tj  and  Tg  write  have  to  be  compared.  If  the  two  sets  are  disjoint  then  Tg 
can  be  validated.  Suppose  that  Tg  is  validated  at  time  tg.  Then  we  see  that  the  schedule 
corresponding  to  the  concurrent  execution  of  Tj,  Tg  and  Tg  in  Fig.  4-6  is  equivalent  to  the 
serial  schedule  which  executes  Tg  first  and  then  Tj  and  then  Tg.  This  illustrates  how  the 
validation  method  enforces  serialization. 
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Figure  4-6:  Three  concurrent  transactions 


A  straightforward  implementation  of  the  validation  method  is  as  follows.  A  set  W  is  Kept, 
which  contains  the  write  sets  along  with  the  validation  completion  times  of  all  validated 
transactions.  For  validating. a  transaction  T  that  just  completed  its  read  phase,  the  following 
steps  are  involved: 

1.  Compare  the  read  set  of  T  with  the  write  sets  of  those  transactions  which  are 
successfully  validated  between  the  start  time  and  finish  time  of  T. 

2.  If  the  read  set  is  disjoint  from  any  of  those  write  sets  examined  in  step  (1) 
above,  then  do  the  following;  otherwise  restart  T. 

i.  LocK  W  in  exclusive  mode. 

ii.  Compare  the  read  set  of  T  with  the  write  sets  of  those  transactions  which 
have  been  successfully  validated  since  the  time  T  completed  its  read 
phase. 

iii.  If  the  read  set  is  disjoint  from  any  of  those  write  sets  examined  in  step  (ii) 
above,  then  validate  T  by  performing  the  following  operations;  otherwise 
unlock  W  and  restart  T. 

a.  Insert  the  write  set  of  T  along  with  the  current  time  as  the 
validation  completion  time  of  T  into  set  W. 
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b.  Make  the  local  <  opios  of  T,  «f„th  contain  ail  the  writes  of  T,  global. 

c.  Unlock  W. 

The  set  W  can  be  pruned  down  by  deleting  information  concerning  validated  transactions 
whose  validation  completion  times  are  smaller  than  the  start  time  of  any  currently  active 
transaction.  When  several  transactions  are  ready  to  be  validated,  the  main  comparison  step 
(step  (/)),  for  one  transaction  can  bo  carried  out  in  parallel  with  the  main  comparison  steps 
for  other  transactions  on  a  multiprocessor.  It  is  possible  to  optimize  the  implementation 
outlined  above  in  a  number  of  ways.  We  will  not  elaborate  them  here. 

Validation  methods  are  superior  to  locking  methods  for  systems  where  transaction  conflict 
is  unlikely.  Examples  include  query  dominant  systems  and  very  large  tree  structured  indexes. 
For  these  cases,  a  validation  method  will  avoid  locking  overhead,  and  may  take  full  advantage 
of  a  multiprocessor  environment  in  the  validation  phase  using  the  parallel  validation  technique 
presented.  Some  techniques  are  needed  for  determining  all  instances  where  an  optimistic 
approach  is  better  than  a  locking  approach.  See  [Kung  and  Robinson  79]  for  more 
discussions  on  this  serialization  method  which  is  not  based  on  locking. 


4.2.1 .3  Remarks 

Serialization  methods  somewhat  similar  to  validation  methods  are  considered  in  [Stearns  et 
at.  76]  for  both  centralized  and  distributed  database  systems.  It  is  pointed  out  there  that  if 
the  ordering  of  the  equivalent  serial  schedule  is  determined  on-the-fly  as  requests  are 
processed,  then  a  situation  similar  to  deadlock  may  occur.  The  situation  is  called  "cyclic 
restart",  in  which  a  finite  set  of  transactions  are  caught  in  a  loop  of  continually  aborting  and 
restarting  each  other.  They  solve  the  problem  by  using  a  preassigned  ordering  of 
transactions.  The  method  outlined  in  Section  4. 2.1. 2,  on  the  other  hand,  uses  validation 
completion  times  to  determine  the  ordering  of  transactions.  Though  the  ordering  is  dynamic, 
the  method  is  not  subject  to  cyclic  restart  because  in  this  method  only  validated  transactions 
can  restart  other  transactions.  C.  Papadimitriou  considers  the  general  problem  of  determining 
whether  a  given  sequence  of  read  and  write  operations  corresponding  to  requests  from 
several  transactions  is  serializable  [Papadimitriou  78].  He  proves  that  the  problem  is 
NP-complete.  Thus  it  is  unlikely  that  there  exist  efficient  schedulers  which  will  recognize  al_[ 
serializable  sequences  of  requests  by  the  transactions.  For  discussions  of  serialization 
methods  for  distributed  database  systems,  see,  for  example,  [Bernstein  et  al.  78,  Rosenkrantz 
et  al.  78,  Stoncbraker  78]. 
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4.2.2  The  Approach  Using  Semantic  information  —  A  Non-Sorialization  Approach 

As  mentioned  carlif the  serialization  approach  requires  no  semantic  information  about 
transactions  and  integrity  constraints,  and  if  such  semantic  information  is  not  used, 
serialization  is  actually  the  only  approach  one  can  take  for  solving  the  consistency  problem  in 
a  concurrent  system.  However,  if  the  meanings  of  the  transactions  and  integrity  constraints 
are  known  a  priori,  then,  as  one  would  expect,  it  is  often  possible  to  design  concurrent 
systems  or  algorithms  enjoying  high  degrees  of  concurrency. 

We  assume  as  before  that  each  transaction  under  consideration  is  correct  if  executed 
alone.  Here  we  further  assume  that  tome  correctness  proof  for  each  transaction  is  available. 
Such  a  proof  must  rely  on  and  also  must  reflect  the  meanings  of  the  transactions  and  the 
integrity  constraints  imposed  on  the  database.  Therefore  a  natural  way  to  capture  semantic 
information  is  to  examine  correctness  proofs  of  the  transactions. 

We  consider  proofs  using  assertions  [Floyd  67].  A  transaction  is  represented  as  a 
flowchart  of  operations  which  manipulate  a  set  of  variables.  Executing  the  transaction  is 
viewed  as  moving  a  token  on  the  flowchart  from  the  input  arc  to  an  output  arc.  An  assertion, 
defined  in  terms  of  the  variables,  is  attached  to  each  arc  of  the  flowchart;  in  particular,  the 
assertions  on  the  input  and  any  output  arcs  are  the  integrity  constraints.  A  correct  proof  of 
a  serial  transaction  amounts  to  demonstrating  that  throughout  the  execution  of  the 
transaction  the  token  will  always  be  on  an  arc  whose  assertion  is  true  at  that  time,  and  will 
eventually  reach  an  output  arc.  The  consistency  of  a  database  under  the  concurrent 
execution  of  several  correct  serial  transactions  can  be  insured  by  the  following  scheduling 
policy  [Lamport  76]: 

The  request  to  execute  one  step  in  a  transaction  is  granted  only  if  the  execution 
will  not  invalidate  any  of  the  assertions  attached  to  those  arcs  where  the  tokens 
of  other  transactions  reside  at  that  time. 

It  is  possible  that  at  some  lime  none  of  the  transactions  can  be  granted  to  execute  their  next 
steps.  This  "deadlock"  situation  can  be  resolved,  for  example,  by  backing  up  some 
transactions.  The  above  scheduling  policy  demonstrates  that  at  least  in  principle  the 
consistency  of  a  concurrent  system  can  be  preserved  by  using  correctness  proofs  of  serial 
transactions.  In  [Lamport  76],  efficient  schedules  are  derived  from  this  scheduling  policy  for 
some  concurrent  systems.  The  schedules  have  the  property  that  they  preserve  consistency 
but  are  not  equivalent  to  serial  schedules. 

The  similar  idea  of  establishing  the  correctness  of  a  concurrent  system  by  showing  that  the 
proof  of  any  of  its  sequential  programs  cannot  be  invalidated  by  the  execution  of  any  other 
program  has  boon  studied  by  several  people,  including  [Ashcroft  75,  Keller  76,  Lamport 


StCTlON  4 


41 


AioosiTrivs  ;or  AjV'.C'i'C'.ous  ‘.Mil  jr-ocr  rsocs 


77,  Owicki  75]. 

The  approach  ol  solving  the  consistency  problem  of  a  concurrent  system  by  utilizing  the 
correctness  proofs  of  the  serial  transactions  seems  to  be  quite  general  and  powerful.  In  this 
framework,  with  enough  human  ingenuity,  difficult  consistency  problems  (or  their  solutions) 
can  often  be  solved  efficiently  (or  explained  elegantly).  Some  of  the  results  in  Section  4.3 
below  can  in  fact  be  cast  in  this  framework.  Much  work  remains  to  be  done  in  developing 
mechanical  ways  of  using  this  approach  in  designing  concurrent  database  systems. 


4.3  Algorithms  for  Specific  Problems 

When  designing  an  asynchronous  algorithm  for  solving  a  specific  problem,  we  have  control 
over  the  tasks  that  will  be  included  in  Ihe  algorithm.  Therefore,  it  is  possible  to  keep  the 
required  synchronization  among  the  processes  of  an  algorithm  as  weak  as  possible,  by  a 
careful  design  of  these  processes.  This  would  not  be  possible  for  general  database  systems 
where  a  transaction  has  no  idea  about  other  transactions  it  might  have  to  interact  with.  As  a 
result,  algorithms  in  this  section  enjoy  much  higher  degrees  of  concurrency  than  those 
algorithms  which  are  derived  from  the  general  techniques  in  Section  4.2. 


4.3.1  Concurrent  Accesses  to  Search  Trees 

We  discuss  how  a  file  organized  as  a  8-tree  or  a  binary  search  tree  can  be  accessed 
simultaneously  by  a  number  of  processes.  The  goal  is  to  insure  integrity  for  each  access 
while  at  the  same  time  providing  a  high  degree  Of  concurrency  and  also  avoiding  deadlock. 

Concurront  Accoss  to  B-troos 

The  organization  of  B-trees  was  introduced  by  [Bayer  and  McCreight  72]  and  some 
variants  of  it  appear  in  [Knuth  73].  We  assume  that  the  reader  is  familiar  with  the  definition 
of  B-trees.  Here  wc  mention  only  that  for  a  B-tree  the  leaves  are  all  on  the  same  level  and 
the  number  of  keys  contained  at  each  node  except  Ihe  root  is  between  k  and  2k  for  some 
positive  integer  k.  The  problem  concerning  multiple  access  to  B-trees  has  been  addressed  in 
a  number  of  papers.  It  appears  that  [Samadi  76]  gave  the  first  published  solution.  In  his 
solution,  exclusive  locks  are  used  by  all  the  processes.  As  a  search  proceeds  down  the  tree, 
it  locks  son  and  unlocks  father  until  it  terminates.  On  the  other  hand,  an  updater  (insertion  or 
deletion)  locks  successive  nodes  as  it  proceeds  down  the  tree,  but  when  a  "safe"  node  is 
encountered,  all  the  ancestors  of  that  node  are  unlocked.  For  the  insertion  (or  deletion)  case 
a  node  is  considered  to  be  "safe"  when  a  key  can  be  inserted  into  (or  removed  from)  that 
node  without  causing  an  overflow  (or  underflow).  It  is  relatively  easy  to  see  that  the  solution 
preserves  integrity  for  each  access  and  is  deadlock  free. 
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[Bayer  and  Schkolnick  77]  observe  that  when  k  is  large  the  chance  that  an  updater  will 
cause  splits  or  merges  on  nodes,  especially  at  top  levels  of  the  tree,  is  small.  Therefore,  they 
propose  that  an  updater  should  place  weak  locks  such  as  share  locks  on  nodes  at  a  top 
section  of  the  tree,  and  only  later  (in  the  second  pass)  convert  some  of  these  weak  locks  into 
strong  locks  such  as  exclusive  locks  if  necessary.  They  present  and  prove  the  correctness  of 
a  general  schema,  which  involves  certain  parameters  that  can  be  tuned  to  optimize  the 
performance  of  the  schema.  Bayer  and  Schkotnick's  solution  is  expected  to  have  good 
average  performance,  especially  when  k  is  large.  In  the  worse  case  however,  an  updater  can 
still  lock  out  the  entire  tree. 

Concurrent  Access  to  Binary  Search  Trees 

In  [Kung  and  Lehman  79b]  algorithms  for  a  binary  search  tree  which  can  support 
concurrent  searching,  insertion,  deletion  and  reorganization  (especially,  rebalancing)  on  the 
tree  are  proposed.  In  these  algorithms,  only  writer-exclusion  locks  are  used,  simply  to 
prevent  the  obvious  problems  created  by  simultaneous  updates  of  a  node  by  more  than  one 
process.  Moreover,  in  these  algorithms,  any  process  locks  only  a  small  constant  number  of 
nodes  at  a  given  time,  and  a  searcher  is  not  blocked  at  all  until  possibly  at  the  very  end  of 
the  search  when  it  is  ready  to  return  its  answer.  We  discuss  some  general  techniques  that 
were  used  for  achieving  this  high  degree  of  concurrency. 

Unlike  the  concurrent  solutions  for  B-trees  described  above,  updaters  are  no  longer 
responsible  for  rebalancing.  An  update  just  does  whatever  insertion  or  deletion  it  has  to  do, 
and  postpones  the  work  of  rebalancing  the  (possibly)  unbalanced  structure  caused  by  the 
updating.  Other  processes  can  perform  the  postponed  work  on  separate  processors. 
Through  this  idea  of  postponement,  the  multiprocessing  capability  of  a  multiprocessor 
environment  can  be  utilized.  The  same  idea  is  used  in  garbage  collection.  Rather  than 
performing  the  garbage  collection  itself,  the  deleter  simply  appends  deleted  nodes  to  a  list  of 
nodes  to  be  garbage  collected  later.  In  this  way,  the  deleter  need  not  wait  until  it  is  safe  to 
do  the  garbage  collection  (i.e.  the  time  when  no  one  else  will  access  the  deleted  node),  and 
garbage  collection  can  be  done  by  separate  processors. 

Another  idea  used  by  the  algorithms  is  that  a  process  makes  updates  only  on  a  local  copy 
of  the  relevant  portion  of  the  tree  and  later  introduces  its  copy  into  the  global  tree  in  one 
step.  With  this  technique  one  can  got  the  effect  of  making  many  changes  to  the  database  in 
one  indivisible  step  without  having  to  lock  a  large  portion  of  the  data.  However,  one  faces 
the  problem  of  backing  up  processes  which  have  read  data  from  old  copies.  It  turns  out  that 
because  of  the  particular  property  of  the  tree  structure,  the  backup  problem  can  be  handled 
efficiently.  The  copy  idea  is  closely  related  to  the  validation  method  discussed  in  Section 
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4.2. 1.2. 

4.3.2  Asynchronous  Iterative  Algorithms  for  Solving  Numerical  Problems 

Many  numerical  problems  in  practice  are  solved  by  iterative  algorithms.  For  example, 
zeros  of  a  function  f  can  be  approximated  by  the  Newton  iteration, 

*i+l  B  *i  *  nzj)'1  <(*;), 

and  solutions  of  linear  systems  by  iterations  of  the  form, 

*j+l  “  AXj  +  b, 

where  the  Xj,  bj  are  n-vectors  and  A  is  an  nxn  matrix.  In  general,  an  iterative  algorithm  is 
defined  as: 

*i+l  =  *i-l»  •  •  •>  *i-d+l^* 

where  <p  is  some  "iteration  function".  Here  we  are  interested  in  parallel  algorithms  through 
which  an  asynchronous  multiprocessor  can  be  used  efficiently  to  speed  up  the  iterative 
process.  We  shall  follow  terminologies  introduced  in  [Kung  76]  for  various  classes  of  parallel 
iterative  algorithms. 

Iteration  function  <p  can  typically  be  evaluated  concurrently  by  a  number  of  independent 
processes.  For  example,  for  the  Newton  iteration  f  and  f*  can  be  evaluated  concurrently,  and 
for  the  matrix  iteration  all  the  components  of  the  vector  Xi+j  can  be  computed 
simultaneously.  In  a  straightforward  synchronized  (parallel)  iterative  algorithm,  the 
concurrent  processes  that  evaluate  the  iteration  function  are  synchronized  at  each  iteration 
step,  i.e.,  a  process  is  not  allowed  to  start  computing  a  new  iterate  until  all  the  processes 
have  finished  their  work  for  the  current  iterate.  Thus,  processes  in  a  synchronized  parallel 
algorithm  may  have  to  wait  for  each  other.  It  has  been  observed  that  by  and  large  iterative 
processes  are  insensitive  to  the  ordering  of  evaluation  as  far  as  convergence  is  concerned. 
This  observation  leads  to  the  notion  of  an  asynchronous  (parallel)  iterative  algorithm,  in  which 
processes  are  not  synchronized  at  all.  In  particular,  by  removing  the  synchronization 
imposed  on  a  synchronized  iterative  algorithm  an  asynchronous  iterative  algorithm  will  be 
obtained.  In  a  truly  asynchronous  iterative  algorithm,  a  process  keeps  computing  new 
iterates  by  using  whatever  information  is  currently  available  and  releases  immediately  its 
computed  results  to  other  processes.  Thus,  the  actual  iterates  generated  by  the  method 
depend  on  the  relative  speeds  of  the  processes.  A  slightly  restricted  form  of  asynchronous 
iterative  algorithms  for  solving  linear  systems  is  known  as  chaotic  relaxation  [Chazan  and 
Miranker  69]  in  the  literature.  G.  Baudet,  in  [Baudet  78a,  Baudet  78b],  reports  the 
experimental  results  from  the  implementation  of  various  parallel  iterative  algorithms  on  C.mmp 
to  solve  the  Dirichlet  problem  for  Laplace's  equation  on  a  rectangular  two-dimensional  region. 
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His  results  indicate  clearly  that  on  C.mmp  asynchronous  iterative  methods  are  superior  to  the 
synchronized  counterparts  with  respect  to  overall  computation  times.  For  a  concise  survey 
of  parallel  methods  (or  solving  equations  the  reader  is  referred  to  [Miranker  77]. 

4.3.3  Concurrent  Database  Reorganization  , 

In  many  database  organizations,  the  performance  for  accesses  will  gradually  deteriorate 
due  to  structural  changes  caused  by  insertions  and  deletions.  E3y  reorganizing  the  database, 
the  access  costs  can  bo  reduced.  The  garbage  collection  in  classical  Lisp  ^plementations  can 
also  be  viewed  as  a  database  reorganization.  In  such  an  implementation  when  the  free  list  is 
exhausted,  the  list  processor  is  suspended  and  the  garbage  collector  is  invoked  to  find  nodes 
which  are  no  longer  in  use  (garbage  nodes)  and  append  them  to  the  free  list.  Database 
reorganizations  are  typically  very  time-consuming.  Thus,  it  is  desirable  to  reorganize  a 
database  concurrently  without  having  to  biock  the  usual  accesses  to  the  database. 

Recently  there  has  been  quite  some  interest  in  concurrent  garbage  collection.  The  goal  is 
to  collect  garbage  concurrently  with  ihe  operations  of  fhe  list  processor.  The  first  published 
solution  is  due  to  [Steele  75],  which  uses  semaphore-type  synchronization  mechanism. 
[Dijkstra  et  at.  78]  gave  a  solution  whose  synchronization  is  kept  as  weak  as  possible,  but 
made  no  claim  on  the  efficiency  of  the  solution,  [Kung  and  Song  77]  gave  an  efficient  solution 
by  using  very  weak  synchronization.  These  solutions  are  extremely  subtle.  We  refer  the 
reader  to  the  original  papers  for  descriptions  of  these  solutions.  Here  we  just  discuss  some 
experience  we  gained  from  the  concurrent  garbage  collection  problem.  Contrary  to  what  one 
might  expect,  it  is  not.  automatically  true  that  because  of  the  concurrent  garbage  collection 
the  list  processor  will  not  be  suspended  too  often  and  thus  on  the  average  be  able  to  do 
more  computations  in  a  fixed  time  period.  For  correctness  reasons,  it  is  necessary  that  some 
synchronization  overheads  be  introduced  to  the  list  processor,  and  consequently  the  list 
processor  is  slowed  down.  Also,  it  is  inevitable  that  the  garbage  collector  will  sometimes 
perform  useless  work.  For  example,  the  garbage  collector  can  be  marking  a  set  of  nodes 
without  knowing  that  their  ancestors  have  just  been  made  into  garbage  by  the  list  processor. 
All  of  this  affects  the  effectiveness  of  the  parallel  garbage  collection.  Similar  types  of 
performance  degradation  are  expected  in  other  instants  of  concurrent  database 
reorganization.  The  central  question  is  how  to  make  the  reorganization  process  effective 
without  committing  excessive  synchronization  costs.  The  problem  can  be  extremely 
challenging  as  we  have  experienced  in  the  concurrent  garbage  collection  case.  This  may 
explain  the  scarcity  of  results  available  on  concurrent  database  reorganizations  today. 

Memory  reorganization  is  just  one  of  the  many  "housekeeping  activities"  performed 
regularly  in  any  large-scale  computer  system.  Ideally,  these  system  activities  should  all  be 
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carried  out  by  additional  protectors  operating  concurrently  with  t He  procerrors  directly 
devoted  to  the  users*  consul  at  ions.  The  system  should  constantly  reorganize  itself  to 
improve  its  service  to  the  user.  The  user  at  Ins  end  simply  sees  a  more  efficient  system 
providing  rapid  responses.  A  feature  of  this  approach,  winch  is  highly  desirable  for  practical 
reasons,  is  that  the  speed-up  can  be  achieved  without  requiring  the  users  to  rewrite  their 
codes.  It  seems  (hat  concurrent  reorganization  (or  housekeeping)  represents  one  of  the  most 
attractive  applications  that  asynchronous  multiprocessors  are  capable  of  supporting.  We 
expect  that  significant  progress  along  this  line  will  bo  made  in  the  near  future,  as 
multiprocessors  become  prevalent. 

4.4  Remarks  for  Section  4 

All  the  efficient  algorithms  mentioned  in  Section  4.3  share  a  common  property,  namely,  a 
process  in  an  algorithm  is  never  made  to  waif  for  other  processes  to  complete  their  tasks. 
The  same  philosophy  is  used  in  validation  methods  in  Section  4.2.1. 2,  in  task  scheduling 
[Baudet  et  at.  77],  and  in  several  other  examples  [Kung  76,  Robinson  79],  This  suggests  that 
this  ”nevcr-wait"  principle  is  a  useful  criterion  to  follow  in  designing  efficient  algorithms  for 
asynchronous  multiprocessors.  A  typical  way  to  achieve  this  goal  is  to  use  copies.  After  a 
process  completes  its  current  task,  it  immediately  starts  working  on  a  copy  of  the  most 
recent  global  data.  Of  course,  validation  is  needed  later  on  to  determine  whether  or  not  the 
updated  copy  can  be  made  global.  Validation  is  not  necessarily  costly  when  it  can  be  carried 
out  in  parallel  on  separate  processors.  Another  technique  to  achieve  the  "never-wait"  goal  is 
the  postponement  idea  as  used  in  the  concurrent  binary  search  algorithm:  a  process  simply 
ignores  for  the  time  being  any  work  it  is  not  allowed  to  perform  immediately,  but  comes  back 
to  perform  the  work  at  a  later  time. 


5.  Concluding  Remarks 


One  can  see  from  the  preceding  sections  that  issues  concerning  algorithms  for  synchronous 
parallel  computers  are  quite  different  from  those  for  asynchronous  parallel  computers. 

Foe  synchronous  parallel  computers,  one  is  concerned  with  algorithms  defined  on  networks. 
Task  modules  of  an  algorithm  are  simply  computations  associated  with  nodes  of  the 
underlying  network.  Communication  geometry  and  data  movement  are  a  major  part  of  an 
algorithm.  For  chip  implementation  it  is  essential  that  the  communication  geometry  be  simple 
and  regular,  and  that  silicon  area  rather  than  the  number  of  gales  alone  be  taken  into 
consideration.  One  of  the  important  research  topics  in  this  area  is  the  development  of  a  new 
theory  of  algorithms  that  addresses  issues  regarding  communication  geometry  and  data 
movement.  In  particular,  it  would  be  extremely  useful  to  have  a  good  notation  for  expressing 
and  verifying  algorithms  defined  on  networks,  and  to  have  a  good  complexity  model  for 
computations  on  silicon  chips.  Some  initial  steps  along  these  directions  have  been  taken  by 
[Cohen  78,  Brent  and  Kung  79b,  Thompson  79a,  Thompson  79b]. 

For  asynchronous  parallel  computers,  one  is  concerned  with  parallel  algorithms  whose  task 
modules  are  executed  by  asynchronous  processes.  The  major  issues  are  the  correctness  and 
efficiency  of  an  algorithm  in  the  presence  of  the  asynchronous  behavior  of  its  processes.  For 
the  general  database  environment  where  Only  syntactic  information  can  be  used,  the 
serialization  approach  is  the  method  for  ensuring  correctness.  Serialization  can  be  achieved 
by  either  locking  or  transaction  backup.  If  semantic  information  about  integrity  constraints 
and  transactions  is  available  as  in  many  special  problem  instances,  then  more  efficient 
algorithms  that  support  higher  degrees  of  concurrency  may  be  designed.  Efficiency  analysis 
of  algorithms  for  asynchronous  computers  is  usually  difficult,  since  execution  times  are 
random  variables  rather  than  constants.  Typically,  techniques  in  order  statistics  and  queueing 
models  have  to  be  employed  (see,  e.g.,  [Robinson  79]).  Generally  speaking,  algorithms  with 
large  module  granularity  are  well  suited  to  asynchronous  multiprocessors.  In  this  case,  a 
process  can  proceed  for  a  long  period  of  time  before  it  has  to  wait  for  input  from  other 
processes.  Many  database  applications  tall  into  this  category.  Further  end  more  detailed 
discussions  on  the  programming  issues  raised  by  asynchronous  multiprocessors  can  be  found 
in  [Newell  and  Robertson  75,  Jones  et  al.  73,  Jones  and  Schwarz  78]. 
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